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ABSTRACT 

This report describes the development and validation 
of the Test of Inference Ability in Reading Comprehension: a 
scaled-answer, multiple-choice test intended for use in Grades 6, 7, 
and 8. The report discusses the need for and conceptualization of 
assessment of inference eUaility; proposes standards and principles of 
inference appraisal; and discusses test design issues, specifically 
audience, Icinds of discourse, topic familiarity, readability, test 
format, test length, and passage and item development. Five pilot 
studies are then presented to show test evolution, providing details 
of the modifications at each phase of test development. The fifth 
pilot study (focusing on test validation) is discussed, involving 
verbal reports of students' thinking as they worked through the test, 
and providing a reading score for the answer selected and a 
corresponding thinking score for a student's explanation of why that 
answer was chosen. Results of this fifth pilot study indicate that 
for 94% of the items good thinking was significantly correlated with 
good inference-maki'ig and poor thinking to poor inference-making. The 
report concludes by presenl g the final data collection, analyses, 
results, and directions for future research. (Thirteen tables of data 
are included, and 98 references are attached. Two appendixes 
providing the reading and thinking rating scales for the test 
concli:'!e the report.) (SR) 
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Abstract 

This report describes the development and validation of assessments of inference ability in reading 
comprehension in the middle grades. The need for and conceptualization of assessments of inference 
ability are raised and discussed within a contemporary theoretical framework. Standards and principles 
of inference appraisal are proposed. The specifications of the assessment address issues such as 
audience, lands of discourse, topic familiarity, readability, test format, test length, and passage and item 
development. The 5 pilot studies are presented to show test evolution. The fifth pilot study focusscd 
on the test validation. Verbal reports of smdents' thinking as they worked through the test were 
collected to provide an independent index of the quality of their thinking as they selected their 
responses. Thus, for each item there was a reading score for the answer selected and a corresponding 
thmking score for a student's explanation of why that answer was chosen. For 94% of the items good 
thmking was significanUy correlated with good iifercnce-making a: d poor thinking to poor inference- 
making. The final data collection, analyses and results are presented followed by a discussion of 
directions for future research. 
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DEVELOPING AND VALIDATING ASSESSMENTS OF 
INFERENCE ABILITY IN READING COMPREHENSION 



This report describes a model for the development and validation of a test of inference ability in 
reading comprehension. The assumption that the ability to make inferences is necessary to reading 
comprehension is widely accepted by reading theorists and researchers. This observation, coupled with 
the fact that no satisfactory procedures, exist for determining the extent to wUch children make good 
mferences when attempting to understand text, motivated this project. This report is built around four 
secuons. The first section wiU offer a contemporary theoretical framework for a test of inference 
abihty m reading comprehension. The design, item development, and test development iterations are 
descnbed in the three subsequent sections. 

The Need for and Conceptualization of a Test 
of Inference Ability in Reading Comprehension 

Evidence abounds to suggest that poor reasoning is prevalent in our students. The National 
Assessment of Educational Progress (1984) reported large decreases in inferendng responses of 13- 
and 17-year-olds over a lO-year period. Furthermore, the Nation's Report Card (Applefjce, Langer, & 
Mulhs, 1987) reports that only smaU percentages of students can reason effectively as they read and 
wnte. Such findings suggest that students are not being guided to perform reasoning activities which 
require analysis and interpretation. 

These findings are not surprising when coupled with research on teaching practices. In the late 1970's 
Dolores Durkin (1978-79) reported that schools do not teach comprehension, and precious Uttle time 
was devoted to having students explain and substantiate their interpretations. The authors of the 
Report of the Commission on Reading (Anderson, Hiebert, Scott, it Wilkinson, 1985) lamented that 
there is very little direct comprehension instruction in most American classrooms. The increased 
evidence of poor reasoning has led to claims about deficiencies in s.'Jiool programs and to appeals for 
action, such as more testing. Yet, most available tests are general and vague about the nature of 
reading comprehension and do not support instructional improvement. It would make sense that a test 
designed to specifically measure inference ability and to support instructional improvement would be 
an important place to start. 

In my study of the Uterature on testing for reading comprehension, and particularly on testing for 
inference ability, standardized tests of reading were found problematic because it is not easy to make 
any decision as to what the tests measure. To highlight some of the problems, it is necessary to 
recognize at the start that prominent researchers in the field dechire that reading assessment bears a 
number of substantial flaws. To elaborate on a couple of pieces of research would serve to illustrate 
some of the flaws. 

Tuinman (1973-74) investigated five widely used standardized tests according to the extent to which 
questions on the tests could be answered without reading the passages upon which those questions 
were based Two points emerged. First, although Tuinman was caurious about his conclusions he 
found significant reason to beUeve that it was general knowledge and not reading comprehension wUch 
was being measured. He suggested that more exploration was needed to discover the extent of this test 
mvahdity. Second, these five major tests did not provide any technical information on the extent to 
which Items could be answered on the basis of information odier than tiiat present in the passage. That 
IS, the major tests have failed to address this significant constivct validity issue. 

The question of validity is a major concern because the predominant approach to construct validity in 
standardized reading tests, that of correlations with ther tests, has a significant weakness. It is based 
on rather circular reasoning because two tests may be intended to measure the same ability, be highly 
correlated, and still/fliV to measure what they were intended to measure. That is, they can pt "^ss what 
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Embrelson (Whitcly) (1983) has called •nomotheiic span," but still fail to be representative of the 
construct they are proposed to meaiUre. 

Many of the problems with standardized reading tests remain unresolved and by continuing to use 
poorly-produced tests we are not recognizing the concerns raised. Anderson (197^) reported that 
educational researchers have not yet learned to develop achievement tests that meet the primitive first 
require tent for a system of measurement, namely that there is a clear and consistent definition of the 
things being counted. A search for definitions of reading comprehension in the major established 
reading tests (many of which were developed prior to the 1970's) reveals that for all intents and 
purposes none exist. Generally, each of the manuals says nothing more than that the test items 
measure specific skUls related to the comprehension of what is stated explicitly in the material, the 
judgment of what is implied, and the drawing of inferences with reference to other si<tuations. The tests 
do not identify >^ch specific skills are being measured by particular items, nor do the tests report 
separate scores for specific skills. Rather, a con oosite score of comprehension, vagueJy defined, is 
reported. 

Much work is required to gfve ri x to orderly, sensible data in the evaluation of reading comprehension 
in general and inference ability in particular. It is my opinion that as a start much can be learned and 
applied to reading from researchers in the field of critical thinking, who have attempted to elaborate 
and clearly define the nature of inference (Govier, 1985; Hitchcock, 1983; Norris, 1984; Salmon, 1984; 
Saiven, 1976). In particular, I refer to the extensive work of Robert Ennis (1962, 1969, 1981, 1985). 
His efforts in characterizing rrional thinkers has been, and continues to be an invaluable source of 
ideas for the project, as were available critical thinking tests (Ennis & Millman, 1985; Watson & 
Glaser, 1980) and a test of induction currently imder development (Norris & Ryan, 1987) which include 
sections testing for inference-making ability. 

Defining Re ading Comprehenxinn 

As might be expected, reading comprehension has been defined in many ways. Articles and books 
(Anderson & Pearson, 1984; Carroll, 1964; Famham, 1905; Goodman, 1968; Gray, 1940; Huey, 1908; 
Plato, 1917; Richards, 1938; Thomdike, 1917) have been written on and about reading comprehension. 
While each article and book contributes to a more thorough understanding of reading comprehension, 
each is incomplete. It is beyond the province of this project to attempt to provide a complete and 
comprehensive definition of such a complex act as reading comprehension. For the purposes of this 
project, I preferred to remain tentative and prepared to alter my working definition of reading 
comprehension with subsequent information. 

Reading comprehension is believed to be a collection of processes such as predicting, inferring, 
synthesizing, generalizing, and monitoring, which have been identified and labelled in various ways by 
different writers in the field (Collins, Brown, & Larkin, 1980; Fagan, 1987; Henry, 1974; Phillips, 1988; 
Smith, 1971). It is widely accepted that reading comprehension involves more than knowing the correct 
pronunciation of the words, knowing their indr^idual meanings, and being able to locate information in 
printed material (Norris & Phillips, 1987; Phillips & Norris, 1987; Spiro, 1977; Tuimnan, 1986). 
Current reading theory defines reading comprehension, more or less, as meaning constructed by a 
reader through strategic and principled integration of the textual information and background 
know^;dge. 

Since an explanation of the intricacies of reading comprehension remains elusive, and since it is agreed 
that reading comprehension is a complex behavior which continues to be perplexing, then one cannot 
set about assessing it in its entirety. Thus, it seems to make sense to study specific aspects of the 
process as a means of seeking advancements in the assessment of the complete process of reading 
comprehension. It is with the process of inference as an aspect of reading comprehension that I am 
most concerned. 
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At a general level, inference is a cognitive process used to construct meaning. Inferring in reading 
comprehension is a constructive thinking process, because a reader expands knowledge by proposing 
and evaluating hypotheses about the meaning of text. 

Good inference-making in reading comprehension requires the thougLtfiU use of strategies (Collins, 
Brovvn, & Larkin, 1980; PhiUips, 1985, 1987, 1988; van Dijk & Kintsch, 1983) and evaluative criteria. 
Inferences in reading comprehension tend to be good to the extent that readers integrate relevant text 
information and background knowledge to construct complete interpretations that are consistent with 
both the text information and background knowledge. 

At a specific level, inference requires intelligent human judgment (Ennis, 1973), and necessitates the 
use of relevant text information and background knowledge. This dependence on background 
knowledge is important for at least three reasons. First, an inference in reading comprehension is the 
mteraction of relevant information provided in the text and background knowledge. In other words, 
neither textual information nor background knowledge alone is su£Bcient to make good inferences! 
Second, background kno^edge enables the generation of alternative hypotheses in inferring. 
Inference is the basis of understanding which often involves transforming, extending, and relating 
information (Markman, 1981). Third, without background knowledge one cannot evaluate the strength 
of inferences to generalizations and explanations (Govier, 1985), thereby making background 
knowledge a necessary part of inferential reasoning. 

The Objec tives of the Tgjgt 

Having implied the complexity of the reading comprehension process in general, and having described 
the process of inference as one aspect of comprehension in particular, it is important to reiterate that 
comprehension is a complicated cognitive process. Indeed, there may be considerable ovcriap and 
mterdcpendence among inferring and the other comprehension processes of attending, analyzing, 
associating, predicting, synthesizing, generalizing, and monitoring. A general test of comprehension 
ability would focus on each of the processes, whereas thft Test of Inference Ability in Reading 
Comprehension (TIA) (Phillips & Patterson, 1987)^ focuses specifically on the process of inference- 
making. 

TIA is designed to appraise the inference ability of middle grade students on the basis of full length 
passages representative of the three kinds of discourse commonly found at the middle grade levels and 
of topics characteristic of classroom reading materials. TIA is designed to inform teachers about 
students* inference abiUty and to provide diagnostic information for instructional decision-making 
purposes. 



A Principle of Inference Appraisal 

The general guideline of ability test validation that directed this research was that the test would be 
valid to the eilent that good inference-making leads to good performance on the test and that poor 
inference-making leads to poor performance. To be in a position to distinguish good inference-making 
from poor inference-making implies that there must be standards for making such distinctions. 
Reading educators should not accept just any inference merely because it reflects some level of the 
reader^s cognitive competence. When we judge someone's inference to be normativel> good we are 
comparing it to what we take to be some standard of expert competence. So, it is important that the 
ben mterpretations are inferences in accord with the best available principles. To be in a position to 
improve reasoning means to be in a position to distinguish good reasoning from had. To do so, implies 
that there must be principles and standards. 

To apply a set of standards to the quaUty of inference-making in reading comprehension certain 
assumptions about the reader, the task, and the text must be made in order to get off the ground. 
These presuppositions or necessary condiaons are stated as follows: 
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I. A reader must: 



1. be competent with the difficulty level of the text, 

2. understand the demands of the task; and 

3. intend to understand the text. 

II. A text must: 

1. be written coherently; 

2. adhere to conventions of communication by being: 

a. as informative as is required for the situation; 

b. accurate or complete with adequate evidence for asserted information; 

c. relevant to the ongoing situation; and ' 

d. unambiguous and clear. 

If these conditions are not met, poor performance on the inference task may be explained through 
failure to satisfy one, several, or all of these conditions, rather than as a lack of infertace ability. 

The satisfaction of Conditiot I and n is necessary for the application of the foUowing principle of 
inference appraisal to judgments of readers' inference ability in reading comprehension: 

Inferences in reading comprehension tend to be good to the extent that a reader integrates 
relevant text informauon and relevant background knowledge to construct interpretations that 
more completely and more consistently explain the meaning of the text than alternative 
mterpretations. 

Completenws and consistency are thus the two criteria for judging interpretations. Neither criterion by 
Itself IS suffiaent; they must be used in tandem. The criteria must also be used comparatively in 
situations where there are competing interpretations. We must ask which interpretation is more 
complete, and more consistent, because often neither interpretation will be fuUy ccmplete and fully 
consistent (Norris & Phillips, 1987; Phillips & Norris, 1987). Thus, the expression "tends to be good to 
the extent that" is an important part of the principle. The expression is a qualifier which signifies the 
Umitauons of the prmdple and emphasizes that it is net an absolute principle. 

Justiflcation of the Pringinio 

The work of rewarchers in four Pelds provides evidence for both the derivation and justification of the 
prmaple of »^erence appraisal presented in this research: critical thinking (Ennis, 1969 1981 1985)- 
?H r^^u^l^^^t^**'' of science (Harman. 1986; Thagard, 1978, 1982), cognitiv^ psydiolog^ 
N^^*"* * Thagard, 1986; McCloskey, 1983; Nisbett & Ross, 1980; Stich & Nisbett, 

J^^^iZ!^ * Markman, 1981; Mason, 1984; Norris & Phillips, 

1987; PhiOips & Noms, 1987). *^ 

The most extensive work done on inference criteria known to me is that of Robert Ennis (1969, 1981) 
He uses the expression "material inferences" and divides these into two categories: those which 
generahze the evidence which is offered, and thoje which derive thet support from their power to 
explam the evidence. The latter category is most representative of the kinds of inferences made in 
readmg comprehension and is thus the focus of this discussion. Ennis presents criteria for judgiug 
inferencM to explanations. The inferences are justified to the extent that: (a) they explain a bulk and 
variety of reliable data; (b) they are themselves explained by a satisfactory system of knowledge; (c) 
they art not inconsistent with evidence; (d) their competitors art inconsistent with evidence; and (e) 
they are sunpler than their competitors. 
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The first criterica is covered in the principle of inference appraisal, where it is stated that a reader 
integrates relevant text information and relevant background kno^xdge to construct interpretations 
that are complete, that is, interpretations that explain all relevant information. Ennis' second criterion 
(inferences are themselves explained by a satisfactory system of knowledge) and his third (inferences 
are not inconsistent with available evidence) arc incorporated into the principle of inference appraisal 
where it says interpretations that more completely and more consistently explain the meaning of the 
text than alternative interpretations. Competing interpretations that are inconsistent with available 
evidence would be judged to be poor, given the principle of inference as u is stated, thereby 
automatically incorporating Ennis' criterion 4 into the principle. Ennis' fifth criterion (inferences are 
justified to the extent that they are ^pler than their competitors) is embedded in that part of the 
principle where it is stated that a reader integrates relevant text information and relevant background 
knowledge. Irrelevant information can lead to a convoluted interpretation, rather than a 
straightforward one based on relevant information. (For a detailed discussion of these ideas, sec 
Norris and Phillips, 1987.) 

A second source of support for the principle derives from research on failures in everyday reasoning. 
According to Nisbett and Ross (1980), shortcomings in human inference-making reflect peoples' failure 
to use normative ♦rindples and, instead, to apply simplistic inferential strategies beyond their 
appropriate liml. They caution tk?t human inference is prone to several major sources of error 
including, to mention two, an over-reliaucc on just the information which happens to be available, and 
an inappropriate weighing of data relevance. Evidence of these fvo errors )m particular bearing for n 
principle of inference appraisal in reading comprehension. In the case of the first error, readers often 
place greater reliance on the text information. In the second case, readers may place too great a 
reliance on some of the textual information or on their background knowledge, thereby failing to 
properly integrate relevant information fi-om both. The point is that any principle of inference 
appraisal in reading comprehension must emphasize the necessity of jsing both relevant text 
informatiou and relevant background knowledge and of property weighing the relevance -^f each. 

Nisbett and Ross (1980) also present evidence that more vivid or salient information is more likely to 
enter inferential processes than is less vivid information. Salient information may influence unduly a 
person's inference-making. Other research has illustrated the tendency for ideas, once formulated or 
adopted, to persist despite evidence which might be disconfinnatory (Hollar'', Holyoak, Nisbett, i 
Thagard, 1986; McCloskey, 1983). It seems people will point to scant positive evidence to sustain their 
original interpretation, even though substantial negative evidence exists to suggest otnerwise. Thus, 
when some people read and are faced with counterevidence, they will either tend to ignore or 
misconstrue the evidence to advantage (PhiUips, 1987). It seems that a workable principle of inference 
appraisal must provide a standard against which readers can monitor whether their interpretations are 
the best explanations; that is, are more consistent and more complete than alternative explanations, or 
are unduly influenced by one or all of the factors menuoned above. 

A third source of support for the principle of inference appraisal is garnered from work in the 
philosophy of science on inference to the best explanation. Inference to the best explanation entails 
accepting an hypothesis on the grounds that it provides a better explanation of the evidence than is 
provided by alternative hypotheses. Three important criteria are proposed by Thagard (1978) for 
determining the best explanation: consilience, simplicity, and analogy. An explanation is more 
consilient than another if it explains more of the evidence than the other by unifying and systematizing 
the information, while at the same time being -iformative. A simple, consilient explanation not only 
explains aU that is necessary, but does so without making a host of assumptions with narrow 
application, merely derived for the moment. The first two of Thagard's criteria, consilience and 
simplicity, thus ofifer support for the standards of "completeness" and "consistency" defined in the 
principle of inference appraisal. Analogy also plays a vital role in good inference-making in that it 
supports the posited hypothesis by improving the explanation that the hypothesis is used to give. 
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Another source of support for the principle of inferencf, appraisal rests wi Jiin the reading field. Ellen 
Markman (1981), in her work on comprehension monitoring, acknowledged that disfjiguishing a good 
inference from a poor one is complex and closely tied to distinguishing better explanations or better 
theories. She posits the question of how readers decide whether or not they have understood. 
Marbnan shows how theories of comprehension inform theories of comprehension monitoring by 
describing two fundamental aspects of comprehension. She argues that well organized or tightly 
structured information is essential to reading comprehension, that comprehension often promotes the 
making of inferences, and that the two are interreUted. 1 propose the foJowing points based on 
Markman's work on comprehension monitoring: (a) goud inferences are highly constrained by the 
context (text and background knowledge); (b) good inferences are based on warranted assiuiptions 
and are progressive in that ihey subsume previofis ideas from the context; (c) good inferences are 
judgments confirmed by subsequent information from the context; and (d) good inferences are 
judgments having elegance and parsimony within the context. The constraints imposed by context (text 
and background knowledge), in the four points above, are embeaded in the principle of inference 
appraisal, thereby indicating that context both provides the subject natter (relevant text information 
and relevant background knowledge) as weU as the parameter! (. . . to construct alternative 
mterpretations) of the interpretation. Warranted assumptions (point b), and inferences that hive 
elegance and parsimony (point d) are 'ntegrated into the prindp'e of inference appraisal in readmg 
comprehension through use of the words "complete" and "consistent" as discussed earlier in Jiis section. 

A further elaboration and confirmation of the above four points is found in the work of Collins, B o-jvn, 
and Larkin (1980), where adult subjects applied at least four different tests in evaluating the plausibility 
of the interpretations they constn»cted The four tests include: (a) the plausibility of the assumptions 
and co:-sequences of the model (when a default assumption or « consequence of the interpretation 
seems implausible, then subjects tend to reject the interpretation); (b) the completeness of the model 
(mterpretations are evaluated in terms of how weH the assumptions and consequences of the model 
answer aU the different questions that arise); (c) the intsrconnectedness of the model ^Jie assumptions 
or consequences of an interpretation are weighted with respect to how ihey fit together with clner 
aspects of the model); and (d) the match of the model to the text (occasionally, readers seem to weigh 
the model in terms of how well its assumptions or consequences matrh certain surface aspects of the 
text). Within Collins, Brown, and Larkin's model the integration of text information sad background 
knowledge in the construction of interpretations is expUaUy stated, as weU as criteria used by adults to 
test the "fit" of their interpretations. 

Section Summary 

The principle of inference appraisal proposed is representative of what is currently known about 
inference and provides a framework within «iiich to better understand the process of inference-making 
in reading comprehension. The principle is intended to be neither canonical nor comprehensive, but 
rather to be an advar-e toward a set of principles. The principle of inference appraisal must be 
considered tentative and alterable in the light of both fu. Jier tmderstanding and empirical evidence. 
However; as shown, it is supported by researchers in the critical thinking, philosophy and philosophy of 
saencc, cognitive psychology, and reading fields. There is a remarkable compatibility and overiap in 
the work, as can be seen by the notions of completeness, consistency, and clarity wWch are all 
consklered to be criteria of sound inferences. 

Specifications of the Test of Inference Ability in Reading Comprehension 

It is difficult to separate the design and development of a test; however, since somewhat distinct 
deasionc were made about each, I have opted to devote a separate section to each. This section wiU 
provide the spedfications on the design of TIA. 
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Teat Development Fmm?"^'''^ 

Audience 

The intended audience for TIA is students in the middle grades. Students in Grades 6, 7, and 8 were 
selected for both theoretical and practical reasons. 

Some of the basic tenets of reading development guided my theoretical aedsions. While it is generally 
agreed that reading development is continuous, it is also agreed that there are stages of development. 
By the time students have advanced to the middle grades, they have read graded materials, content 
area subjects, and have generally achieved some degree of independence in the reading process. These 
facts make it more manageable to separate out inference ability problems, should they exist, from other 
problems such as vocabulary, syntax, or other failures (Anderson & Pearson, 1984; Gentner, 1983; 
Vosniadou & Ortony, 1983). 

There is research which suggest^ that there are developmental differences in story comprehension 
(McConaughy, 1980). It seems that even grade 5 students focus more on the literal aspects thai* on the 
interpretive aspects. Another reason for selecting middle grade students is they have had some 
instruction in making inferences. 

Kinds of Discourse 

Discourse is typically classified as one of four types: (a) exposition, (b) narration, (c) description, and 
(d) argument (Bock & Brcv;er, 1985; Brewer, 1980; Spiro & Taylor, 1987). Exposition answers real or 
imaginary questions. It presents fajts or explains why something is important, how something works or 
what a thing means. Narration informs readers of what is happening it is an account of events or 
action and includes characters, plot, theme, and st^e. Description Ls a discourse form used to appeal to 
the senses of the reader and is generally about the appearance of an object, a person, or an event. 
Argwtent is a form of discourse in which there is an attempt to convince or persuade through appeals 
to reason, emotions, or to both. Exposition, narration, description, and argument often overlap so this 
global classification omits much of the complexity of discourse. In practice, clear-cut classifications are 
not always possible. 

Three of the four kinds of discourse are more familiar to students in the miGdle grades; argument is 
less familiar. The three fiiU-length stories^ in TIA represent the com^aon discourse forms found at ±z 
middle grade levels, thereby providing a more thorough appraisal of students' inference ability across a 
variety of reading materials than tests which assess performance on either isolated passages and 
questions or on one dfccourse form. Narrative, expository, and descriptive texts make distinct demands 
upon readers, readers' knowledge, and wpectations about a task and have important consequences for 
cognitive processing aud learning (Anderson & Armbruster, 1984; Brewer, 1980; Spiro & Taylor, 
1987). For in5tancr,, narrative text is often argued to be easier to understand than expository text for 
both adults and cliildren (Bc^eiter & Scardamalia, 1982; Dock & Brewer, 1985) possibly because 
readers are less familiar with how expository texts are organized. Since the three discourse forms are 
an integral part of programs which students at the middle grade levels are expected to learn, then 
differences in comprehensibility between narrative, descriptive, and expository texts must be taken into 
account for diagnostic purposes. 

Topic Familiarity 

The role played by background knowledge in reading comprehension has attained such widespread 
acceptance that it no longer requires a justification. The prior or background knowledge that a person 
brings to a text is said to be one of the most important factors in understanding, remembering, and 
interpreting text information (Anderson, Spiro, & Anderson, 1978; Ausubel, 1963; Holmes, 1983; 
Johnston, 1984; Pearson, Hansen & Gordon, 1979). Furthermore, while topic familiaritv or possession 
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of requisite domain knowledge does not necesi uHy guarantee intercct, it does 'ufcct the readabilitv and 
comprehension of text (PhiUips. 198-/; Walker, 1987). Topic familiarity is seen to be necessary, but not 
siimaent for comprehension. 

Background knowledge done is not sufficient for reading comprehension because a reader must know 
/lotf to use that knowledge and want to use that knowledge. This U a particularly relevant point in the 
appraual of inference ability because a rrader must seek a complete interpretation that is consUtent 
with both the text oformation and background knowledge in order to make good inferences. Since 
readers must mtegrate background knowledge with the text information to infer, then to try to separate 
background knowledge from the text information is to deny the role of background knowledge in 
reading comprehension, and of course, once you read something, it becomes part of your background 
knowledgs. It is not dear what readers' performance on such a test would mean. Therefore it is 
necessary to assess whit prior knowledge students have in order to make accurate appraisals of 'their 
uuerence abuity in reading comprehension. 

Furthermore, since it is an objective of TIA to serve as a diagnostic tool, then it must be realized »hat 
readers do not ahways integrate completely test information and background knowledge. Sometimes 
readers mtegrate only some of the relevant text information and background knowledge; other times, 
readers wiU select relevant text information and background knowledge, but faU to mtegrate the two 
There are oca^ons wh^re readers fail to do any of the above and as a consequence fail to make an 
mlerencc, go off course m their interpretation, or make unwarranted assumptions. 

A multipUdty of approaches were undertaken during the development of TIA to establish accurate 
«tunates of middle grade readers' topic familiarity. A more thorough discussion of the procedures will 
be presented m a subsequent section, "Selection of Topics." At this point, it is sufficient to say that 
scwral groups were invohred in determining topic familiarity and interest: (a) a group of graduate 
students were asked to list 10 topics which thev felt their students were iaterestcd in; (b) 130 teachers 
attendmg a workshop were asked to list 10 topics t'lat they felt their students were interested in; (c^ 300 
middle grade students were asked to list 10 topics that they thought they could write about without 
difficulty; and (d) 12 middle grade classes were selected to discus some of the topics and to write 
about them. From these intormation sources three topics were seen to be common areas of interest 
and withip the background knowledge of the intended audience of the TIA test. 

Readability 

The readability of text is generally assumed to refer to its legibility, ease of reading, and ease of 
understanding. Many readability formuhie have been developed over the years, but they have not been 
without cnucism. Traditional readability formulae have been criticized for having no point of reference 
(Manzo, 1970), for neglecting the importance of the structure, texture, and informational density of text 
(Amiran & Jones, 1982), and for lacking face validity (Coupland, 1978). 

Alternative wys of estimating readability have been proposed, including die subjective text difficulty 
approach l^Tamor (1981) and otiiers (Carver, 1975-76; Singer, 1975), die psycholinguistic approach by 
Holland (1981), and die conceptual approach by Rubin (1981). Tamor combines text-based 
mformatmn (readabihty estimates) and performance-based information (recall scores) to come up widi 
a subjective text difficulty level for individual readers. Holland's psycholinguistic alternative focuses on 
assessmg die meaning-making demands pUced upon readers by die language and structure of die text 
Rubms conceptual approach focuses on th". crncepts conveyed by die text: how arguments are 
presented, die role of examples in a text, and how 'Jiaracters' interactions are developed and described. 

I weighed and balanced die available evidence and decided not to use a readability formula in die 
traditional sense. Radier. I chose to use what may be described as a composite of bodi die traditional 
and alternate approaches to readability. I chose to adhere to die principles of good story writing and to 
ask students who pUoted die test to assess the readability of die stories. 
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The TIA stories were wri. i on three topics identified to be familiar to middle grade students. In the 
initial pUot st»d»i, students were asked to read the stories aloud and to point out areas of difficulty. 
When the areas identified to be problematic were revised, the text was judged to be appropriate for the 
intended audience. In accord with Conditions I and U given in the previous section, it was important 
that the stories and inference questions be written coherently and adhere to conventions of good 
communication; it was also essential that n reader be competent with the difficulty level of the text, 
understand the demands of the task, and intend to understand the text. Otherwise, readers' poor 
performance on the inference task may be accounted for by a failure to meet these conditions, ra.her 
than as a lack of inferc ce ability. 

Test Format 

TIA contains three full-length stories (average of 465 words per story): a narration, an exposition, and 
a descnption. Stories consist of four to five paragraphs and 12 scaled-answer, multiple-choice 
questions. Questions follow each paragraph in the stories. Each question has four answers provided. 
To answer the questions students are to use information given in the story and information that could 
only come from their background knowledge. Students are given an example which is thoroughly 
worked through so that they wiU see that they are to consider all possible answers before deciding 
which answer they think is the 'best' one. 

The challenge in changing reading assessment is to come up with new means to evaluate our current 
conccptujlizafion of reading and to diagnose areas vh',ie instruction is needed. :. Hing 
comprehension admits of degrees. However, credit has generalfy been given on most tests of reading 
comprehension for one and only one correct answer. There has been no allowance for parti?" ' correct 
responses, that is, for evidence that a student may be capable of selecting relevant informf .on without 
quite knowing what to do with it. 

The chall-nge in the design of TIA was to provide diagnostic information about students' performance 
and 10 use that informatiou to support instructional improvement. To achieve this end, TIA represents 
a creative melding of the old and oew. The old format of selecting an answer is there vAth the new 
advantage of giving credit for answers that are not completely correct. TIA may be described as a 
'scaled-answer multiple-choice" test, since it attempts to account for variations in understanding. The 
four answer choices represent a range in values (0-3) assigned according to the quality of the answer 
selected. An answer that is consistent with both the text information and backgrouud knowledge is 
worth 3 points; a partially-correct answer is worth 2 points; a text-based answer is worth 1 point; and an 
erroneous answer is worth 0 points. (A complete copy of the scoring guide is provided m Appendix A.) 

Test Length 

TTie current version of TIA may be administered in a class period (50 minutes). This allows time for 
pving instructions and the actual test-taking time. It is intended to be a power test, so students are 
given a reasonable amount of time to complete aU items. From teacher reports, it appears that the 
average test-taking time is 30 minutes so this may be used as a rough guide if teachers wished to use it 
as a speed test. 

Selfiction of Topics 

Differences in background knowledge among students and students at different grade leveU can be 
mcnifestea in different ways. These differences may lead '.o variance in performance on reading 
comprehension tests and hence to invaUd interpretations of students' performance. It is desirable that 
the world views, or empirical beliefs needed to interpret a story on which test items are based, be ones 
that most students share (Norris, 1988). If scores on TIA are to be taken as measures of inference 
abdity m reading comprehension, then it is necessary to reduce as much as possible the effects of 
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background knowledge. To minimize differences in performance which might be due to differences in 
background knowledge, rather than to differences in inference ability, iteir 'ere selected on the basis 
of their familiarity to students in Grades 6, 7, and 8. 

Sensitivity to the issue of background knowledge led to a comprehensive study of topics for potential 
selection for item development. The six stages of the study are described next. 

Stage one: graduate students. The first stage involved eight graduate students with a diversity of 
teaching experiences. Each student was asked to list 10 topics ^ch he or she felt students in Grades 
6, 7, and 8 would be interested in and have knowledge of, and to provide a justification for their 
choices. Thv ^e are the 10 topics which the graduates identified most frequently: travel, space, videos, 
sports, animals, money, friends, future, styles, and science fiction. 

Stage two: teachers. In the second stfge, 130 middle grade teachers attending a professional 
development workshop were asked to list 10 topics which would interest their students and to justify 
their list. The teachers identified the following topics most frequently: cars, space, money, science 
fiction, holidays, music, mystery, computers, sports, and hobbies. The graduate students' and teachers' 
lists overlapped on the following topics: space, sports, mon^,, future, and science fiction. 

Stage three: students (topk idcntillcatioo). In the third stage, the ideas of middle grade students 
were sought. Three hundred students at Grades 5, 6, 7, 8, and 9 were asked to list 10 topics they 
thought they could write about without difficulty. Grades 5 and 9 were taken in addition to Grades 6, 7, 
and 8 to account for potential differences at the upper and lower limits of reading ability in the 
intended test group. The most preferred choices of the students are the following: money, space 
things, sports, pets, getting out of school, holidays, movies, space friends, war, and travel Overlap in 
topic choices among all three groups indicates that the most popular choices are money, space or 
space-related topics, sports, getting out of school, holidays, and pets. 

Stage four stad;;nts (unassigned written essays). In the fourth stage, middle grade students were 
asked to write on a topic of their choice. This was done to distinguish topics which students would 
choose to write about from those that might sound exciting but about which they would be unlikely or 
unable to write TweWe classes of students in Grades 6, 7, and 8 were asked to choose from the most 
common topics identified up to this point (money, space or space-related thin^ sports, pets, getting 
out of school, holidays) or any other topic and to write an essay. 

The essays were generally about space, money, and pets in one way or another. Specific differences 
existed m the general topic for instance, essays about pets varied fron. the time it takes to care for 
them to how pets are wonderful friends. Beari^ in mind that each story on TIA was to be 
representative of the reading materials at the middle grade levels, then from the most popular student 
topics thiee topics were selected: UFOs, Money, and a Newspaper Mystery. 

Stage Bye stndenU (assigned written essays). In the fifth stage, 65 students in Grades 5 through 9 
were asked to write a story about UFOs, Money, and a Newspaper Mystery. These essays were studied 
for vocabulary-choice, sentence and idea complexity, and form. 

Stage six flnal topic seiectioa. The sixth and final stage of topic selection went through three phases 
involving free recall and word associations, recognition, structured and unstructured questions and 
discussions on each of the three topics. These phases represent a synthesis of re-«arch on assessing 
background knowledge (Adams & Bruce, 1980; Anderson, Spiro, & Anderson, 1978; Hohnes, 1983, 
1987; Hohnes & Roscr, 1987; Pearson, Hansen, & Gordon, 1979; Spilich, Vesonder, Chicsi, & Voss, 
1979; Walker & Yekovich, 1984) anr! are taken to be some of the best available ways to assess 
backgiound knowledge. It took approximately five class periods to establish students' background 
kno^edge of each topic 
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In Phase 1, students were asked to free recall or to brainstorm on each of the topics. They were 
directed to think ol all information they would expect to find in a story about UFOs, a story about 
Money, and a Newspaper Mystery. Also, students were asked to come up with associations for UFO- 
reUted words like heavenly bodies, evidence, and explanations; for Money-related words like uses, 
forms, characteristics, and changes; and for Newspaper-related words like responsibilities, carrier, 
weather, and newspaper-related confusions. 

The second phase involved recognition activities to identify any misunderstujding which students might 
have about each of the three topics. These activities were developed from die Phase 1 discussions. For 
example, it became evident tiiat some students tiiought Uiat scientists know what UFOs are and tiiat 
UFOs are meteors and "stuff like that" in tiie sky. Students were asked to identify from a prepared 
sheet dealing witii tiiese matters several possible correct and incorrect answers to questions such as 
"What are UFOs?", "Is a meteor a UFO?", "WoiJd diere be UFOs if sdeu^ists know what tiiey are?" 
The answers provided to tiiese kinds of questions led into (ht final phase of topic selection. 

Discussions guided by structured and unsirjcturei questions completed tiie final phase of establishing 
topic familiarity. A structured question on tiie Money topic, for instance, was "What is money?" Such 
questions led to unstructured questions about tiie topic such as "Do aU jungle tribes have money'" and 
lively discussions were held witii tiie students on each of tiie three selected topics. 

For tiie purposes of tiiis project, students were assumed to hiw a sufficient amount of topic familiarity 
It tiiey were able to speak to each topic according to a general outline as follows: 



UFO Outline 



I. What UFOs are believed to be 

II. What UFOs are reported to look like 
in. Where UFOs are reported to come from 

IV. Available evidence 

V. Why UFOs are studied 



Money Outline 

I. What money is 

II. The characteristics of money 
in. How money developed 

IV. Functions of money 

V. Why the form of money changr» 



Newspaper Delivery 

I, ' Carrier's responsibilities 

II. Knowing tiie route 

m. The importance of time 
IV. How to deal witii people 
V. Potential problems 



Furtiiermore, students were deemed to have sufficient relevant background knowledge if at least 70% 
of tiiem were able to speak to tiiese outiines. Seventy percent was taken as a satisfactory lower bound 
of general knowledge on sach topic because, assuming randomness, tiiis would mean tiiere was less 
tiian a 3% chance tiiat a student would have knowledge of neitiier topic and less tiian a 10% chance 
tiiat a student would have knowledge of fewer tian two topics. In tiie case of tiie TIA topics, tiie 
specific student levels were 80%, 75%, and 90% on tiie UFO, Money, and Newspapers topics, 
respectively. Thus, tiie chances of systematic bias against any student across all topics is minimal. 
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The comprehensive information gathered from the topics identified, the students, written stories* and 
class visits guided the choice of topics for the stories a the TU test. Three stories were written for the 
TIA tc«: UFOs, Money, and The Wrong Newspapers. The UFOs story was modified from previous 
research projects (Beebe A Phillips, 1980; Phillips, 1985) and continues to be a winner among students 
It IS a story about unusual phenomena, telling of different UFO reports, offering plausible explanations 
for some of the reports, and suggesting that with improved technology wc may be able to explain 
UFOs. The Money story is a description of the everyday use of money, of how it works, as well as its 
histoncal development. The third and final story is a mystery entiUed The Wrong Newspapers" which 
involves a mix-up in nc-vspapcr delivery, with the culprit being the neighbor's dog. 

Principles of Story Writing 

Story grammars have been developed to illustrate underlying text structures. The most common types 
of text used m the middle grade levels are narrative, descriptive, and expository. Each is organized in a 
particular way and it is believed that children use the structure, once they have it internalized, to assist 
them in understanding and recalling information (Stein, 1983; Thomdyke, 1977; van Dijk & Kintsch, 
1983). 

There is overlap in the classifications of narrative, expository, and descriptive structures since all three 
may be found in the one story. The Wrong Newspapers story fits more within the narrative 
dassificauon than either the expository or descriptive. However, the UFOs and Money stories are 
harder to classify because they overlap considerably with the «q)osition and descriptive forms. 

The principles of story grammar were followed in writing the TIA mystery narrative entitled "The 
Wrong Newspapers." The principles may be summarized as follows: There should be a setting which 
introduces the characters and provides the time and place of the story; an initiating event should occur 
which sets the story in action; there should be a response to that action followed by an attempt to 
achieve a goal or to respond to an action; the consequences of that attempt woven with a reaction are 
provided. These principles, coupled with a sensitivity to vocabulary choice, sentence structure, and 
sentence length, were in our minds during the story writing process. The data obtained when students 
read the stories was used to make final judgments on topic and story suitability. 

Narrative provides a resolution or stopping point and therefore it is easier to identify its underlying 
structure or grammar than is commonfy the case in expository ni^erial. Expository material has at 
least sn underlying s'-u. : reb: serial, topic, restriction and iUustration, definition, argumentation and 
comparison-contrast. T! ; t r* panem may be considered the generic basic structure since the others 
arc more general seco s ju-* ss. These five structures (topic, restriction and illustration, 
definition, argumenta*: .? oa on-contrast) are more perplexing than the serial structure from 

another perspective h-ca »^ ^ -eiial pattern there may be occasions when other structures are 
used. Consider thr. -a- in r iodal studies text where forms of travel are being studied in 
predomiifantly a sei'..' ^ but for a couple of paragraphs modes of travel are compared and 
contrasted, foUowe.^. n .v end of the chapter by a generalization about the most efficient means of 
travel So, it is common to see much overlap and intermingling among text structures. 

The TIA stories on UFOs and Money were written following primarily a serial pattern: A general 
concept IS presented in each story; generalizations combined with examples are stated; a sequence of 
events unfolds; a conclusion follows. The authors were cautious to ensure that vocabulary choices were 
cither known or explained and that sentences were coherent. The UFOs, Money, and The Wrong 
Newspapers stories were further subjected to the Anderson and Armbruster (1984) test of 
undetstandabiHty; do the stories provide enough relevant information to achieve the author's purpose 
and to be meaningful to its readers? The evidence from the groups of students in Grades 6, 7, and 8 
who read and discussed the stories is that the test of understandability was passed. 
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The Evolution of the Test of Inference Ability 
in Reading Comprehension 

The present form of TIA represente the results of six phases of evolution in design and development. 
This section will provJe details of the modifications at each phase of the test development as weU as a 
rationale for them. A description of the test validation techniques that guided the aesign and 
development of TIA will also be discussed. 

Initial Test Vgrsinn 

The Initial Test Version contained three stories (UFOs, Money, and The Wrong Newspapers) and 48 
short-answer questions. It was given to a graduate reading class and a coUeague who was engaged m 
the development of a test of inductive reasoning in critical tiiinUng 

On the baas of feedback from these two sources the test was edited and written more concisely. The 
nimber of test questions was reduced from 48 to 36 because the test was too lengthy even for these 
subjects. This task was made simple for two reasons: (a) 5 of the 12 questions were judged to be 
•nsufBaenUy related to the stories to allow for more complete and consistent inferences to be made- 
and (b) 7 of the 12 questions couki not be easily identified as inference questions, so in cases of doubt 
the questions were dropped from the test. 

Pilot Study 1 rShnrt-A oswcr Versinii^ 

Trial One 

FoUowing the complstion of the revisions to the Initial Test Version, the first pUot was conducted. A 
short-answer format, rath»r than a multiple-choice format, was used to help understand what the 
qu^ons measured. In addition, this trial administration was done to check on test length, passage 
difficulty, vocabulary choices, clarity of instructions, and question ambifuities. The effects of story 
order were studied. As reported m a previous section, research shows that narrative text is generally 
easier to understand than exposUory text. If this is so, then differences in performance could be 
expected if the order of presentation of discourse types was altered. If differences were identified, then 
story order would be an important conskleration in subsequent test development. 

Sample and procedure. Sixty-five studenU in Grades 6, 7, and 8 participated. Test booklets were 
distributed randomly to students with the three stories (UFOs, Money, and The Wrong Newspapers) 
coUated m the sa story combinations. The directions and sample paragraph and inference question 
were discussed with the students. StudenU were told that they Tvould have to use their background 
knowledge and the text information to answer the questions, that they wouU 'cad three stories, and 
Uiat each story would have four or five paragraphs and questions on each paragraph. Students were 
direrted to read each paragraph, to write their answer to each of the corresponding questions, and to 
justify their answers. When all student enquiries were answered, then the test was started. 

RMults. The pattern of smdent responses to the inference questions was one of the most significaat 
findin|p. Students' answers were of four types: an impUusible response; a non-inference response; a 
partially-correct inference response; or a complete inference response. (A more detailed description of 
these may be found in Appendix A.) 

It was found that the test was too long, since it took an average of one and one-quarter hours to 
complete, not counting the time necessary to give and complete the sample item. 

No differences in performance on the basis of stoiy order were found. Students may have acquired 
new knowledge whUe taking the test; however, it did not add to nor detract from their performance 
when story order was altered. 
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Test rcvisioBS. From an examination of students' answers numerous revisions were made: (a) six 
questioiis were reworded to make their meaning clearer and one question was deleted because of 
^biguity: (b) three questions in the UFOs story (#8, 9, 10) were re-sequenccd as #10, 8, 9 to match 
the text sequence; (c) sentences m the text judged to be too similarly worded to the corresponding 
inference questions were deleted; (d) some sentences were modified to be nore general, and less 
expliat, theieby making the corresponding inference questions more challenging; and (e) other 
sentences were changed to clarify me aning , 

Upon completion of the revisions based on the results of Trial 1, the number of questions on TIA tor 
the Tnal 2 study was 36: 12 for each story. 

Trial 2 

Trial 2 of PUot Study 1 (short-answer version of TIA) was completed to determine whether the four 
typw of responses identified in Trial 1 would be repUcated with more refined passages and questions 
If they were corroborated, then subsequent test development would have to take these response 
vanations mto account, if test performance was to be taken as a valid indication of student ability. 

Swnple and procedure. One hundred students in grades 6, 7, and 8 took the short-answer version of 
the TIA test. The same procedure was followed as in the previous trial 

RjMuJts. Students' written responses and accompanying justifications for their answers were studied. 
The trend of reader response variations identified in Trial 1 was evident in the responses on this trial 
Student responses for each question fell within 1 of 4 response patterns identified in Trial 1 The four 
vanations (an unpUusible response; a non-inference response; a partially-correct inference response; 
or a complete mference response) in student performance became a major factor in the future desim 
and development of TIA. 

Test Kvlsioos. Since a multiple-choice fonaat for TL\ was the ultimate aim, the fourth version of TIA 
mvoNed writing a scaled-answer multiple-choice set for use in the second pUot study. The item form 
on the short-answer and multiple-choice versions of the test were modified so that they would be 
idenucal ' 

It was presumed that a sentence stem format, where possible, might help to reduce writing time, 90 
minutes on the short-answer version, to one class period (50 minutes). Distractors for the multiple- 
choice version of the test were derived from students' answers on the written short-answer versions 
trom Pilot Study 1. Each set of four possible answers were scaled as follows: 
worth 0, a non-mference response worth 1, a partially-correct inference worth 2, and a complete 
mference worth 3. This "scaled-answer format" was used to afiford students the option of selecting the 
type of response they woukl likely make if they were taking the short-answer version of TIA. Multiple- 
choice Items were constructed such that distractors were consistent in grammatical style, vocabulary, 
and length. Answer were randomly assigned to serial position (A, B, C, or D). 

Pillrt StBdY 2 rShort-Answer/Scfllg Vnswer MultiplcChoicg Vprsinns^ 

The s«»nd pUot study was conducted to serve four purposes: (a) to examine the degree of similarity 
of performance on the short-answer and th-, scaled-aaswer multiple-choice formats; (b) to compare 
compleuon tunes required by both test formats; (c) to corroborate whether the four patterm of 
responses identified in Pilot Study 1 would be displayed by the students in this pUot; and (d) to identify 
potential item ambiguities, vc jbulary difficulties, and other problems 
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Sample and Procedure 

Eighty-one students in Grades 6, 7, and 8 partidpated in PL >t 2. Forty students wrote the short-answer 
version and the remainder wrote the multiple-choice veision. The same procedure described in 
previous pilots was followed, with one exception. The students taking the multiple-choice version were 
provided widi answer choices and cautioned to consider all possible answers before deciding the 
answer they thought best. 

Results of Multiple-Choice Format 

Test completion time ranged from 50-75 minutes on the multiple-choice fcMinat for classes in Grades 6, 
7, and 8. This represented an average reduction of 10 minutes compared to the short -answer format; 
unfortunately, we expected a greater reduction. 

Item analysis, on the basis of correct or incorrect answers, of the multiple-choice format showed a KR- 
20 reliability of 0.68 and a test mean of 17.5 items correct out of a total possible 36 items, with a 
standard deviation of 4.73. The test means for Grades 6, 7, and 8 were 14.0, 17.1, and 19.6 respectively. 
Item/test biserial correlations and item difiBculty indexes were computed and are presented in Table 1. 

[Insert Table 1 about here.] 

It can be seen from Table 1 that 3 of the 36 questions had negative biserial correlations (questions 18, 
20, and 35). Examination of these three items, coupled with students' short-answer responses, revealed 
the answer sets for questions 18 and 35 to be ambiguous. Question 20 required students to consider a 
historical perspective, but it appears that most students answered it from a current events perspective. 

Questions 8 and 28 had very low biserial correlations. It was dear, upon examination, that the 
problems with questions 8 and 28 were vocabulary-related It seems that many students did not note 
the relevance of particular examples which were dted. For instance, "meteors" were dted as an 
example of "astronomical events," but students did not see the relevance in answering item 8 which read 
"Other kinds of astronomical events that people mistake to be UFOs are." It seems students did not 
understand "astronomical events," so it was replaced with "heavenly bodies." Revisions were made to 
aU aspects of the lest identified to be either definitely or potentially problematic 

The item difficulty Icveb also pointed to problems with questions 18 and 35 discussed in the preceding 
paragraph. Question 13 was among the more difBcult items; it seems that a word in the question stem 
was interpreted differently by many students from the test authors. The question read "Money is 
needed in at least two different ways;" students interpreted the question by focussing on the word 
"needed" as necessities. The test authors intended the item to get at the idea of commonality of money. 
Clearly, the problem with the item was wiih the wording and not with how the students interpreted it. 
Item 13 was revised to read "Money is a familiar part of our lives because." 

Results of Short-Answer Format 

Students' refuses on the short-answer format were examined and the type of answer identified 
(implausible response, non-inference refuse, partially-correct inference, complete inference). A^ain, 
the pattern of student responses was consistent with the two previous trials under Pilot 1. This result 
was taken as evidence that the category scheme could be helpful in constructing distractors that 
students performing at each level would find plausible. 

Student responses on the short-answer format were compared to the multiple-choice key to assess the 
agreement between the number of refuses per item that received full credit. The results are 
resented in Table 2, It should be pointed out that for purposes of this analysis an item on the short- 
answer format was not considered correct unless it expressed the same meaning as the answer keyed as 
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correct on the multiple-choice format, consequendy the percentages of agreement between the two are 
necessarily lowered For instance, consider item 30 which says "Ann wanted to hand deUvcr Mr 
Jones's newspaper because." The response keysd correct on the multiple-choice format and the one 
required on the wntten short-answer format would be "to make sure he got it and to talk to him about 
the mystery." So, unless students responded on the short-answer version with a compound answer, they 
were not scored as completelj- correct, even though they may have been partially correct. A lower 
percentage of agreement was found on those items that required students to synthesize stoiy 
information. It seemed that if questions required students to puU together more than one piece of 
informauon to formulate a complete response, then they experienced difficulties or did not consider aU 
a>^ble relevant information. Question 26, for instance, required the synthesis of three pieces of 
informauon; however, the most common response written on the short-answer test and selected on the 
muluple-choice test was a partially-correct one. 

[Insert Table 2 about here.] 

The mean on the 36 item short-answer test was 12.97. A recognized restriction of this pilot was that 
one, and only one, answer was deemed acceptable, which undoubtedly ignores a range of answers 
which may have been partially correct. Bearing in mind this restriction on accepUble answers, tiien it 
seems reasonable to expect that the level of overaU performance may have been reduced The mean 
on the multiple-choice test was 17.15, which reflecu a significantly higher level of performance. 
Another erplanation for the lower performance on die short-answer test could be related to tbt fact 
that students had to construct and write an answer, which would seem to be a mors demanding task 
than selectmg an answer on die multiple-choice test. Student performance on die multiple-choice test 
may be a "better" indication of dieir reading ability dian die short-answer test where performance is 
confounded with students' ability to express dieir ideas in writing. Also contributing to lower scores 
was the fact that students tended to leave more items unanswered on die short-answer test dian on die 
multiple-choice test. 

Test Revisions 

Paaage, question, and answer modifications were made to TIA prior to die next pilot. Revisions were 
made to each of die Uiree stories. For instance, on the UFOs stoiy, it was found duit students failed to 
attend to die word "not" in die sentence, "Many of the older reports are not complete so we need to 
continue to study UFOs," consequently leading to an erroneous response. The sentence was modified 
to Many of die older reports are incom{4ete so we need to continue to study UFOs." 

Some questions were repUced because Uiey did not require students to make inferences, and some 
answers to odier questions were replaced because of ambiguity. Odier revisions included substitutions 
m word-choices and changes in information pUcement. Further revisions included making answer 
selecuons more parallel with one another. For instance, "plastic cards" was reptaced wiOi "dub cards" 
m item 23 to make it more paraUel widi die other options: "trade items," "credit cards," "chocolate 
bars. 

PMot Study 1 rVi»rhal P yports as Diita^ 

Pilot Study 3 was conducted using a verbal report medioddogy. The use of vcrt)al reports in test 
constru^n has been suggested by several testing experts (Anastasi, 1988; Cronbach, 1971; Haney & 
Scott, 1987; Messick, m press). Verbal reports were used as a mediod to validite whedier a complete 
mference had been made when shidents selected the answer keyed as correct to help ensure diat 
muluple-choice test quesUons were functioning effectively as inference questions. In addition, such an 
approach « particularly useful in test development for revealing potential item ambiguities, vocabulary 
problems, and hidden cues. 
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Care was uken to develop bterview procedures which would not jeopardize the quality of information 
to be coUected and conclusions to be drawn. Two trial verbal report sessions with six students each 
were held to ensure that the two interviewers understood the demands and limits of the approach, as 
well as to determine whether the information needed from the students was bring acquired. 

Sample and Procedure 

Thirty-six students in Grades 6, 7, and 8 participated. Students were each assigned to 1 of the 3 stories 
on TIA. They were told that to answer each question they would have to use information given in the 
story and information they already knew. They were told that the story would not dirccUy answer the 
questions and they would have to use their common sense along with the story information. Students 
were advised to consider all possible answers before deciding which answer they thought was the best 
one. Each student worked through a sample item with an interviewer. Once the sample item was 
completed and students' questions were answered, students were asked to read aloud each paragraph, 
to rca^ the corresponding test questions, to select 1 of the 4 answers provided, and to tell why they 
thought that answer was the best. 

Interviewers questioned students only if there was a lack of clarity in a response, such as an unspecified 
pronoun antecedent or an answer so terse or vague that it was too incomplete to foUow. At the end of 
the test mterview, general questions were asked about students' interest in the story, about whether the 
passage vocabulary gave them any problems, and about whether there were other things that were 
u-sdear to them. Each verbal report protocol was transcribed and a scheme developed to code the 
quality of students' responses. 

Scoring Responses 

In order to reflect the range of responses shown by students in the verbal reports, a scoring system was 
devised to aDow aedU for partially correct, i s weC as complete, inferences. Scores from 0 to 3 were 
assigned on the basis of the range of complete s"-; of the student responses. See Appendix A for the 
criteria for grading the test of inference ability. 

The foUowing question (Ql) and its possible answers (A, B, C, or D) iUustrates the scoring system. 

Ql UFOa are sometimes called other names because 

(A) people name then according to their shape or ptobabie origin. 

This answer is a complete inference and therefore, is given a score of (3). The relevant textual 
mformation was contained in sentence three, "People sometimes caU UFOs fiying saucers, spaceships 
from other planets, and extraterrestrial spaceaaft." Using background knowledge it can be concluded 
that the naming cnteru for UFOs in this story are based on either shape ("saucers") or probable orimn 
("other planets" and "estratenestrial"). 

The integration <rf the relevant textual information and background kno\»dedge makes (A) the best 
inference response for question 1. 

(D) people doat know what to call them so name them by shape. 

This answer is given a score of (2). It is a partially conect inference for question 1 because it only 
considers one of the naming criteria, shape, when the textual information suppUes two criteria. The 
cntenon of shape was selected for this alternative instead of the criterion of origin because shape was 
focussed upon m all instance- of explanations by students in the verbal reports. 
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(O people sec an area with many colored Ilf^ts in the sky 

This answer is given a score ci (1). It is based on textual information from sentence 5. However, the 
relevant textual information U contained in another area of the text (sentence 3). Although the textual 
informatiwi selected deals with the appearance of UFOs, it is not the most relevant part of the text. 

(B) people know they are nnidentiflcd flying objects in the sky 

This answer b scored as (0). It is the least correct answer because it makes no sense either in relation 
to the utt, or in relation to background knowledge. People do not know for certain that what they see 
are unidentified flying objects, and this is not the reason given in the text tiiat synonyms exist for UFOs. 

Answer Set Revisions 

The process of revising answer sets based on students' verbal reports had two complementary phases. 
One phase dealt with editing existing answer sets and die odier witii developing new answer sets which 
would reflect the range of answers students gave in their verbal reports. 

Answer sets were revised where students' explanations of their choice of answer showed either that 
students made an inference but still selected a less than best answer, or tiiat tiiey used inadvertr ndy 
placed cues in die answer set to select die best answer. The second phase is discussed in die next 
section with question revisions. 

Vocabulaiy and Question Revisions 

A number of terms which students did not understand became apparent in the verbal report data. 
Samples of vocabulary tevisicm include die foUowing substitutions, "scientific equipment" for 
"technology," and "heavenly bodies" for "astronomical events." Care was taken to maintain die intent of 
die text and use of precise terms while substituting appropriate vocabulary for students at die Grades 6, 
7, and 8 levels. For example, in looking for a substitution for "astronomical events" children's science 
texts, children's science encyclopedias, and science reference books were consulted. 

Eleven questions were deleted from diis test version. Five of die 11 questions were judged to be based 
too heavily on students' background knowledge. The 5 questions did not meet die principle diat a good 
inference question is one tiiat requires a reader to integrate relevant text information and background 
knowledge to construct complete mterpretations diat are consistent widi botii die text information and 
background knowledge. Four rf die 5 questions required students to give answers based on word 
knowledge. For instanos, one of tiie items sUted "The word independence in tiiis story means," which 
could have been answered widiout reading die text. In anodier item, students' lack of background 
knowledge hampered students in making a comirfete inference, so die question was deleted. The 
deleted item read "Money might be more risky to use tiian credit cards because," but according to 
students' verbal reports, diey did not know diat credit cards could be cancelled, and dierefore less risky 
to lose than money. 

The remaining six questions were deleted for a variety of reasons. Difficulty level indices from 
previous pilots indicated tiiat question 13 was one of die most difficult questions (see Table 1). Student 
verbal reports indicated differences in word interpretations from diose intended by die audiors. Item 
13 sa:** "Money is reused by," it .<eems students interpreted "reused," diough illogical, to refer to die 
same money being used over and over or saved by a single individual and not die circulation of money. 

Questions requiring students to make time frame shifts were proble-njtic, as evidenced in students' 
verbal reports and in die item analysis results. For example, students' vetbtl reports showed diat diey 
responded to item 20 which read "Years ago, cows, coffee, and shells did not keep dieir value as wcU as 
money today because" from a current events perspective. A typical response was "diey are not wanted 
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by cvciyonc, whereas money might be because if you traded with people from the dty they might not 
need cows," Item 20 had gone through three revisions and yet students sees ed to focus on the current 
rather than the past, so the item was deleted The remaining two questions did not function well as 
inference questions because they were judged to be too text-dependent, so they were deleted 

Students' verbal reports also pointed to item ambiguities (items 9, 18, and 35), For example on item 9, 
students interpreted "find out" to mean discover new facts, when the intended meaning was "learn." So 
the item stem "Using the reported infcMmation we would find out the most about UFOs b/ was 
changed to "Using available information people learn ^he most about UFOs by." This modification 
required students to make the inference that the "available information" was the reports described in 
the story. 

Stoiy Passage Revisions 

The final section of test revisions in this pilot deals with the story passages. The major change was with 
the "Mone/ story. Due to the fact that the first five inference questions in the "Money" story were 
deleted, the first two »tory paragraphs were also deleted Two new paragraphs and five new inference 
questions on the functions and characteristics of money were written for the "Money" story. 

Minor changes were made to other paragraphs through deletion and addition of sentences. Sentences 
were added to story paragraphs in instances where more textual information was required for a specific 
inference question or where a new test question had been added For example, the sentence "Weather 
conditions are checked when scientists study available info:mation about FJFOs" was added to the 
second UFOs paragraph to complement the question "Weather conditions affect UFO sightings in the 
sky because" (UFOs Q5). The sentence in the third paragraph of ±t "Money" story, "Large animals 
made trade difficult because there was too much price difference in items," was deleted There was 
insufficient story information about trade items for students to answer the corresponding infere;ice 
questions. Changes of specific vocabulary in story paragraphs were discussed under editing ci test 
vocabulary. Remaining changes were cosmetic in nature. 

PUot Study 4 (Kto^H R^m^M 

The revised scaled-answer mukiple-choice test was given to two fourth-year college classes in the 
Faculty of Education at Memorial University of Newfoundland Sixty-one students participated in this 
pilot. The purpose was twofold First, to have an expert adult sample confirm the researchers' rating 
decisions for young students' refuses on the TIA test; and second, to have the experts take the test in 
order to pinpoint any remaining ambiguities or other problems which might necessitate revision. 

So-called experts may be quite unreliable judges of items written for younger students because the 
adult conception of ^t is and is not familiar may be quite different from that of younger students. 
For example recall the item, "Money might be more risky to use than credit cards," required students to 
know that credit cards may be cancelled A study of students' verbal reports revealed this to be a piece 
of information which they did not know. Consequently, wliile adults consistently made a complete 
inference on this if^n, the middle grade students never d*± The item was dropped because it did not 
measure students' inference ability. 

For 85% of the items the experts rated the young readers' responses consistent with the rating assip" jd 
by the researchers. The remainifig 15% were taken to need further revisions. In addition, comments 
and queries made by the experts were studied and appropriate changes made. 

It was assumed that college students would get all 36 items comet on a test intended for middle grade 
students, so only correct answers were scored However, the mean number correct for the two college 
da&ies was 23.95, out of 36 items, with a standard deviation of 3.68 and a KR-20 reliability of 0J8. 
There seemed to be distinct divisions among the expert sample about some of the items. For instance. 
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there were adults who wrote "there is no best answer here " on an item that required them to synthesize 
two or more pieces of information. It seemed some of the e3q>erts would indicate the right informadon 
needed for an answer, but would not pull the infonnation together to make a complete and consistent 
inference. The remaining experts seemed to have little diflBculty making complete inferences 
consistent with those of the researchers. Thus, the majority of the experts w«re taken to be reliable 
judges of the best answers. 

Pilot Study S 

Test Validation 

The fifth pilot study was designed to study the relationship between students* answer selections and 
their thinking processes in making those select'ons. One purpose was to determine the quality of 
students' thinking when they selected their answer for each test item. Understanding students* thinldng 
processes is of fundamental importance because students often arrive at good answers without thM'mg 
well and at less than good answers even though they may have thought wcIL A second purpose was to 
find out whether the verbal report process afifected studenU' performance, either positively or 
negatively. 

Specifically, four issues mot^ated the validation procedure; (a) to determine whether students 
understood the task, that is, that they were to use information from the text and from their background 
knowledge to answer the inference questions; (b) to determine whether students understood each test 
item and reasoned well when they picked the best answers; (c) to determine whether students who 
chose an incorrect answer to an item did so because they did not reason well; and (d) to determine 
whether verbal reporting affects performance in comparison to writing the test. 

For a test of inference ability in reading comprehension to be valid, the test should require that 
students make a complete inference when they select the best answer for an item (Phillips, 1986). One 
assumption in multiple-choice test construction is that when students select the best answer for a test 
item, they do so for the right reason. However, it is possible that students might selea the best answer 
for a test item without fully understandhig it. For example, there may be some inadvertent cue 
prompting students to choose the right answer. A second assumption is tiiat students ^o choose the 
incorrect answer do so because they are not reasoning well, yet students might select an incorrect 
answer for a good reason. For example, there may be an alternate interpretation from tiiat intended by 
die authors, leading students to choose a less dian complete answer even though they reasoned well. 
Thus, it is important to have students e3q)lain their reasoning when they select their answer for each 
question. 

Students' thinking ability was examined by having them report verbally why they had selected their 
answers. Answers were afanost always chosen for reasons. The aim was to study those reasons for 
evidence tbtt good thinking leads to good performance on TIA and poor reasoning leads to poor 
perfomunoe. To do this, there must be a description of the reasoning prDcesses which lead to 
performance and a way to rate the quality of students' thinking. A study of studeu^s' thinking processes 
was done in an attempt to gam more information on the reasoning underlying their answers than is 
possible by merely considering the selected responses. Information about the nature of the reasoning 
process is pertinent to OMistnict validation (Embretson, 1983). Evidence for construct validity of an 
ability, in tins case inference ability, is obtained to die extent diat good performance can be explained 
by students' reasoning well and to die extent diat poor performance can be explained by not reasoning 
well. Thus, die j am of construct validation is to identify the causes of performance on tests. The aim 
widi TIA was to develop a test such that good inference-making was the cause of good test 
performance and die construct validation dirough die use of verbal reports is conducted to determine 
die extent to which diis has been achieved. These verbal report protocols were used in conjunction 
widi die students' answer selections to provide information for test validation. The general principle 
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foUowcd was that tests would be valid to the extent that good inference- making led to eood 
performance and poor inference-making led to poor performance. 

Sample and Procedure 

One hundred ar.d eighty-three students in Grades 6, 7, and 8 at three schools participated in this pilot. 
The students were selected at random from intact classes and assigned randomly to eitiier of two test 
condiuons. TTiere were 95 students tested in die verbal report condition and 88 students in tiie written 
test condiUon. 

Students in tiie written test condition completed tiie multiple-choice test in their classrooms in a 
mamier simdar to any ^oup test. The same administration procedures described in PUot Study 2 were 
fouowed. Students m die verbal report condition were interviewed by two interviewers using the same 
procedure described m PUot Study 3. Students in die verbal report cohort were assignedito 1 of the 3 

J^^ is, the first student was assigned story 1, the second assigned stoiy 2, 
the third did story 3, and the fourth student did story 1, thus starting a repeat of die cycle. The total 
administrauons per story were as follows: 34 students completed the 'UFOs' story, 32 students 
completed die Money* story, and 29 students completed The Wrong Newspapers' story. 

Coding 

Three sets of data were collected: reading scores from the written cohort; and reading and diiokina 
scores f om die veri)al report cohort. Written cohort selected responses were scored accordingto 
cntena developed m Pilot Study 3. (See Criteria for grading TIA in Appendix A.) TTie readingsa)re 
for an answer ranged frcun 0 (implausible) to 3 (complete). Students' total readmg scores were die 
sum of die values assigned to aU answers selected by students. The total possible score is 108. 

Verbal report expUnations of students in die verbal report cohort were assigned thinking scores. The 
dunking score is a measure of die net supporting evidence for an item avaikble from the verbal 
reports. TTie quahty of students' explanations for each answer was rated according to specific criteria. 
(See appendix B for a copy of die Thinking Rating Scale.) Thus, for each item diere was a reading 
score for die answer selected and a corresponding diinkmg score for a student's exphnadon of why tbu 
answer was chosen. ' 

A trial sample of dunking protocols was selected at random from die diree stories and grades. Two 
raters mdependendy assigned a diinking score to e;^ch answer justification. Any inconsistencies 
between raters sconng of diinking protocols were studied, TTie initial rating of diis small sample of die 
verbal report protocols allowed changes in die category descriptions of die diinking rating scale before 
aU die protocols were scored, ^ot instance, it was observed dut sometime; students simply repeated 
die answer diey had selected as dieir explanation. In die initial diinH-ig rating scale diere wzs no 
provision for such a response. Consequendy, a change was necessary and a diinking score of (0) was 
assigned for such responses. 

^^il^ ''e™ tl»>nking scores assigned by die two raters were compared. Inter-rater 

rehabthtiM on both comparisons results in correlation coefficients greater duin .90. Any explanations 
asMgKd different dunkmg scores by die raters were discussed and re-rated. Widi a high level of 
rehabihty on die rating of students' explanations established, it was concluded diat dieVemainina 
protocols Muld be consistendy scored. About 25% of die remaining protocols were checkcdjj 
random and found to have a similarly high level of inter-rater reliability (i > .91). 

Data Analysis 

-Je data analysis examined six questions: (a) To what extent were students' reading scores and 
Uunkmg scores on each test item in die verbal report cohort correlated? TTiat is, did students who 
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reasoned weU select the best answer and did students who reasoned poorly select an incorrect answer' 
(b) To what extent were students' total reading and thinking scores for each story in the verbal report 
cohort correlated? (c) How did students' reading scores in the verbal report cohort compare with 
reading scores in the written cohort? (d) How is performance on each item related to overall test 
performance? (e) Did students' reading scores vary by grade level? and (0 Were there interviewer 
streets on test performance? 

Results and Discussion 



Reading and thlnUng reiatioosUps for Items. Table 3 presents Pearson's correlation coefficients 
between students' reading and thinking scores by test item. A positive correlation significant at less 
than the .05 level between reading and thinking scores was found for 34 of the 36 items with an average 
correlation of .55. Reading and thinking scores for item 10 were not significaaUy correlated and for 
Item 17 they were uegatively correUted. These two items were examined, but no problems were 
apparent The results of the previous pilot studies were examined and no indications of problems with 
Items 10 and 17 were found. The dual decision was to leave the items without changes and to examine 
them m the next trial. 



(Insert TaUc 3 about here.] 

For 94% of the items good thinking was significantly correlated with good reading and poor thinking to 
poor reading performance. This result provides strong evidence that generally when students thought 
weU they selected the best answer and when students reasoned poorly they selected an alternate 
answer. The significant correlations between reading and thinking scores for items is one piece of 
evidence that TIA is a valid test of inference ability. 

Reading and thinking relatiooshlps for storks. The reading and thinking relationship for each item is 
by necessity related to this relationship for each story TweWe items accompany each story, therefore 
Items 1-12 af;ompany story 1 'UFOs,' items 13-24 accompany story 2 'Mc iey,' and items 25-36 
accompany story 3 The Wrong Newspapers.' Table 4 presents Pearson's correlation coefficients 
between total reading and thinking score for the three stories. The correlation coefficients were 
smular, high, and significant at the .001 level for the three stories. 



(Insert Table 4 about here.] 

It is reasonable to conclude that students understood the itoms and that students who selected the best 
answers thought weU. Thus, the significant reading and thinking relationships for stories is taken to be 
another piece of evidence that TIA is a valid test of inference ability. 

Reading performance relatloBshlpa between verbal report and written cohorts. Table 5 presents story 
readmg score means by cohort The maximum reading score for a story would be 36, as each story has 
12 test tfeou, with a total possible score of (3) per item. Means for story 1 and story 2 were very 
sumtar fot the verbal report and written cohorts. Means for story 3 differed by 2.8 in favor of the 
verbal report cohort. While not significant it is not clear why a difference occurred. This difference 
tiaaslated mto test performance would amount to the verbal report cohort doing better on one item. 
Actoss the entire test, the overaU mean for the verbal report cohort is 24.7 and 233 for the written 
cohort. A difiference between means of 1.4 translates into less than half an item correct in favor of the 
verbal report cohort. Thus, the difference in means on story 3 was not taken to be large enough to 
invahdate the verbal report methodology. Asking students to think aloud does not significandy alter 
theu' performance. 



(Insert Table 5 about here.] 
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v**.tSS!** ^ ^^'^ ^ significant effects for cohort for either 

the UFO« or Monev stories. However, cohort showed a significant effect, f (1, 111) « 6.06, d < 05 for 
The Wrong Newspapers srfory. It is not easy to explain why a difference in performance by cohort was 
found on only The Wrong Newspapers stoiy. 

Grade had a significant effect on students' reading scores for story 1 (UFOs), £ (2, 116) - 7.58, o < 
.001 but was not significuit for stories 2 or 3. The discourse type may account for the grade effect 
found for story 1. Students in Grades 6, 7, and 8 may all be familiar with the descnptive and narrative 
discourse forms of stories 2 and 3. But, studenU' reading scores on expository material (story 1) might 
show an unprovement for students in Grades 7 and 8 when compared to Gride 6 students. There was 
no Significant interaction effect between grade and cohort. 

In sum, reading performance between the verbal report and written cohorts was taken to be highly 
reUted. Assuming that verbal reports are an accurate representation of the thinking that went on 
dunng the test-taking and the reports are an accurate represeatation of the thinking of those in the 
wntten cohort, then it can be con-iuded fi-om the evidence presented that students understood the task 
and reasoned weU when they picked the best answer. In addition, the usefubess of verbal reports in 
understanding students' reasoning and vaUdating less direct measures of inference is stronalv 
supported. 

Hie rdatiooship of item performaiice to story performaoce. Students in the verbal report cohort 
completed only one stoiy, so item analysis results are presented by story for both the verbal report and 
wntten cohorts. 

The item/tert biserial correlation coefficients were positive for all test items and ranged from a low of 
.1« (itetn 17) to a high of .693 (item 3). The correlation coefficients show that generally students' 
performance on mdividual tesst items was positively related to overall test performance. 

The difficulty indices were computed as the proportion of students picking the best answer This is not 
the best mdicator of difficulty for scaled-answer items because it does not take account of partially 
correct scores (1 or 2). In a subsequent section on final dau collection, an index computed as average 
score on an item is used, but the rough index based ou rights and wrongs will suffice here because only 
the best answer was to be considered. A low difficulty mdex (.100) wouM indicate a more difficult test 
uem than a test item with a high difficulty index (.600). The difficulty indexes of the test ite-ns ranged 
from a low of .197 (item 1) to a high of .746 (item 12). TLc range in the difficulty level of items was 
expcetea and is withm normally recommended bounds. 

The KR-20 reliabilities ;.culated separately for each of the three test stories (12 items each) and only 
for the correct answer for the combined verbal report and written test cohorts were 057, 0.23, and 0.50 
for^stones 1, % and 3. The written test cohort completed all 36 test items with a KR-20 reliability of 

In sum, the results of the relationship of item performance to overall stoiy performance was taken as 
evidence that studenU understood the ta?' and that students reasoned wcU when they chose the bp«t 
answer Thus, it is concluded that TIA requires students to make a complete inference when they 
select the best answer. 

RMding scores by grade. Students' reading scores by grade level were also examined. Since only the 
wntten test cohort in the fifth pUot study completed aU 36 test items, then only their scores were used 
m this part of the analysis. As the highest reading score for each test item was (3), the maximum 
reading score for 36 items was 108. Reading score means and standard deviations are presented in 
1 able 6 for Grades 6, 7, and 8 students. 

[Insert Table 6 about here]. 
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A onc-wiy ANOVA was performed with reading score as tie dependent variable and grade as 
independent variable. Grade was found to have a MgnSfi^unt effect, £ (2, 87) « 4.98, < .01. The 
overall tread in mean performance from Grades 6 to 8 was a desirable result. It is assumed that 
students would perform better in making inferences with each passing grade. Sit ce TIA is intended as 
a measure of inference ability in Grades 6, 7, and 8, then a significant difference by grade suggests that 
TIA is suCBdently discriminating to detect differences in performance, should they exist, at each grade 
leveL 

Intervlcmr effect oo test paforauuice. Verbal reports of students' reading i nd thinking scores were 
analyzed by story with interviewer as the independent variable. Two separate one-way ANO VAS were 
performed for e>ch of the three stories. Therefore, for each story there was one analysis with reading 
score as the dependent variable and a second ana^ vrlth thinking score as the dependent variable. 
No significant effect for interviewer was found at the .05 level for any of the six ANOVAS calculat ed. 
Interviewer, therefore, did not seem to affect students' reading or thinking performance in the verbal 
report cohort. This was an encouraging result which provided support for the usefidness of trial 
interviews to eliminate potential interview problems which may affect the primary data collection. 

Section Summiiry 

The data analysis and test results discussed in the preceding sections show the development and 
statistical support for TIA as a test with both construct validity and reliability. Each subsequent version 
of the TIA test was an improvement over each previous version and it was not dear what would be 
gained from further d4ta collection, so the pilot studies were considered to be complete and the TIA 
test ready tor final data collection. 

Final Data Collection: Analysis and Results 

This section reports the demographics of the samples, the final daU collection procedures, and the 
basic statistics. It also discusses potential extraneous influences to test performance and presents the 
reliability estimates of TIA. 

Samnles and Data roliitftinn 

Nine hundred and ninety-nine students in grades 6, 7, and 8 from schools in Alberta, N^'wfoundland 
and Labrador, Nova Scotia, and Ontario comprised the samples for the final data collection. Contact 
was made with educators at schools and school board ofiBces and their cooperation was sought for the 
administration of the TIA test. 

When approval was granted to proceed with the project, the contact persons were forwarded the 
necessary materials. They either arranged to give the tests themsekes or for classroom teachers to 
adminidw them during scheduled language arts classes. Each participating teacher was given a copy of 
the dire<lioQS (jce footiMe 3). The original conUct person was the fridlitator for each province. That 
person took responsibility for distributing the materials, ensuring their proper administration, 
collecting the materials, and returning them to The Jnstitute for Educational Research and 
Development at Memorial University of Newfoundland, The final data collection took place iuring the 
winter and spring of 19S7. 

Students in the Alberta sample were from an urban centre with a population of approximately 60,000. 
It is a trading centre for an agricultural-based economy. The students were described by their teachers 
as mostly middle class. The schools range in population from 150^, Less than 4% of the school 
population is English as a second language (ESL) or native studenU. Classes were described as having 
a few bright students, a majority of average students, and some studenU requiring additional assistance 
with instruction through remediation classes or learning assistance programs. 
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The Newfoundland and Labrador sample of students were from two large rural centres. In one centre 
the populadoo, including surrounding villages, is approximately 10,000. The area may be described as 
econraiicfJIy deinressed with the majority cf families described as low to middle class. The students 
were froai a school with a total population of 430 students. None of the students were ESL students; 
however, some hau been iovoNcd in French immersion programs. Oasses were described as 
heterogeneous. The other rural centre has a combined population of approximately 14,000 people in 
two adjacent towns. It is a mining centre with a high employment rate and, for the most part, middle 
class families. The students are from schools ranging in size from 350^ students. There are no ESL 
students, but French immersion programs are common. The classes were described as heterogeneous. 

Students in the Nova Scotia sample were from two rural areas ranging in population from 2,>100-5,000 
people. The economy is farm-based, with the communities compri^ng a mixture of lower and middle 
class families. The populations of the two schools were 200 and 275 with no ESL students. Classes 
were described as heterogeneous. 

Students in the Ontario sample were from an urban centre with a population ranging from 60,000 to 
110,000 mciuding the surrounding areas from which children are bussed to the dty schools. The 
students were from a wide range of economic levels, and many from sin^e parent homes. Classes were 
described as heterogeneous with about 20% of the students requiring remedial instruction. Less than 
3% of the school population includes ESL students, and brighter students often go into French 
immersion program^ after Grade 4. The economy in the area is built on service and government 
institutions. 

Analysis and Rg^lt^ 

Students recorded theit answers to the TL\ questions on a standardized answer sheet. Each question 
has four possible answers (A, B, C, or D). Each answer is worth a value of either a 0, 1, 2, or 3 
dependent upon the quality of the selected response. The appropriate value was assigned to each 
selected response and the assigned values totalled to constitute th- test score for each student. 

Table 7 also presents mean pcrfcMinance scores for the entire sample by sex, grade, and age. The 
mean for tht entire sample of students for whom data was complete (N - 974) is 73i7 with a standard 
deviation of 13.63. Table 8 shows the ANOVA main effects on these same variables. Of particular 
relevance in this report are the main effects of sex, grade, and age on test score. 

[Insert Tables 7 & 8 about bera.] 

In the Mse of significant sex differences, the females* mean performance is higher than the males'. A 
comparison of the means presented m Table 7 indicates a difference of about two points. Based on the 
perplexing welter of research findings on differences between males and females, it seems many 
questions remain unanswered (Downing, May, & OUila, 1982). Questions about such matters as the 
effects ot different cultural ejqpectations, genetic factors, and teacher-model differences all seem to 
point to the necessity of further research prior to the draw^Jig of any conclusions. Differences in this 
data on the basis of sex are minimaL Furthermore, differences in performance between the sexes are 
small in comparison to the range of differences m performance within a sex. For these reasons it is not 
believed that the TL\ test is biased in favor of any sex, nor that the data is untrustworthy for 
generalization purposes, regardless of sex. 

Tabic 8 shows grade and age significant at b < .05 level In the case of significant differences by grade 
it is important to know whether the differences are between Grades 6 and 7, Grades 7 and 8, and 
Grades 6 and 8. Sheffes a posteriori comparison of means test was done (Kirk, 1968). While Sheffes S 
method allows for the calculation of significant differences in means whc« there are unequal n's, it does 
set the highest critical statistic of all the multiple comparison tests. The only critical difference in 
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means on the TIA test was between Grades 6 and 8 where the difference in means (5.70) exceeds the 
critical value of 3^. It is likely that had it been appropriate to use a less rigorous test that differences 
between Grades 6 aad 7 and between 7 and 8 would have been found Nevertheless, the significant 
differences between Grade 6 and Grade 8 indicate a general tendency for performance to improve with 
grade level 

The grade by age interaction is a result of the fact that scores inaease by grade only for students with 
^jrade-appropriate ages. Students who are old for their grade tended to do more poorly than students 
with grade-appropriate ages. 

Item Stiitistics 

Item Difficulty 

Typically, difficulty level indices reported on a multiple-choice test arc given as the proportion of 
students getting an item right, but on a standard multiple-choice test the only scores are 1 and 0, whe^^* 
1 is for the correct choice and 0 is for any other choice. Thus, the proportion of people getting the item 
right, that is the difficulty level, is just the average performance on the item. So by extension, the 
difficulty level index for the TIA test in ^ch possible scores are 0, 1, 2, or 3 is again the average 
performance on the item. See Table 9 for the percentage breakdown of students who chose an answer 
worth 0, 1, 2, or 3 for each test item (total of 36 items). The percentage of students ^o chose answers 
other than the best (the most consistent and complete) for each item reflects the variability in 
performance. Item difficulties are Given in Table 10. When reading this table, note that the higher the 
difficulty level index the ei^er the item. As can be seen from Table 10, there is a range of difficulty 
levels across the test. For instance, many students found item 12 fairly easy whereas item 4 appears to 
have been mor difficult for them. These indices represent a range of challenge for students. 

The Relationship of Item Performance to Overall Performance 

Typically, the item/test correlation is computed using a biserial correlation coefficient. This is a 
correlation between dichotomous (0,1) and continuous variables (0-36 range of items). However, on 
the TIA test item scores are not merely dichotomous variables, but rather interval variables (0, 1, 2, 
and 3), so the appropriate statistic is the Pearson's r. Table 10 presents the item/test correlations. 
There was one negative item/fest Uscrial correlation (item 14) and one ^ch was essentially zero 
(item 17). Item 14, as can be seen froa: TaWe 9, was answered by the greater proportion of students as 
a non-inference question (a score of 1). In other words, students chose a text-based response. Such a 
response by the majority of students points to a problem with either the wording of the question or a 
perceived hi^ similarity among answer choices on the part of the students. A reexamination of 
students' verbal reporte from the last pilot study indicated a problem with word-choice with item 14. A 
revision has been made for future uses of TIA. Item 17, also on the Money story, demanded a high 
level of understanding of the features of money ^ch may have becj unfair to the students. A 
reexamination of students' verbal reports corroborated the suspicion that in order to choose a 2-point 
answer, they made a sop! isticated, complete, and consistent inference. The scoring key was modified 
for item 17. 

[Insert Tables 9 & 10 about here.] 
The Relationship of Stoiy Performance to Overall Performance 

TIA is made up of three of the most conunon discourse forms found in the middle grades. Research 
indicates that narrative, descriptive, and expository texts make distinct demands upon readers, thus it 
seems that differences in comprehensibility between narrative and descriptive and expository texts 
should be expected TaWe 11 shows the percentage of all refuses dcidi^ scores of 0, 1, 2, 3 by 
grade level and stoiy. Seventy-two percent of all responses on The Wrong Newspapers are quality 
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responses (scores of 2 and 3) compared to 66% on the UFOs story and 68% on the Money story The 
Wrong Newspapers story is a nrrative. the discourse form taken to be the easier of the three, yet the 
difiFerences in performance are not as dramaf Ic as c^qpected. This result raises an interesting question. 

[Insert TMt 11 about here.] 

A quesdon which remains to be studied is whether particular types of inference questions, regardless of 
discourse form, present more of a challenge to students than others. There is circumstantial evidence 
frona the pilot studies and from questions rated as difficult on the finfl study to support such a 
suspicion. Logical inference questions on TIA that required the synthesis of several pieces of 
information were more likely to be answered in an incomplete manner than informational inference 
questions such as elaboration or setting the contert, regardless of the discourse form. A plausible 
explanation for the minimal performance differences as displayed in Table 11 is that inference 
questions requiring the synthesis of several pieces of information were asked on all three discourse 
forms. If students experience difficulty i logical inference questions as suggested, then perhaps the 
type of question asked is an important va able in addition to the discourse form being studied. 

Potential Extraneous Inflqfflygs 

Potential extraneous influences on performance on the TIA test include such factors as test-taking 
strategies, test-wiscness, and guessing. While these are not mutually exclusive, I will deal with each 
separately. 

Test-Taking Strategies 

Care was taken to provide clear, unambiguous directions to all TIA test-takers. Students were 
informed that they were to use information provided in the story and information they already knew in 
deciding upon the best answer. They were told to think about which answer out of four they thought 
was the best one. A sample item was done with the students. (For a copy of the complete instructions 
see footnote 3.) Special mention was made of the importance for students to consider all possible 
answers before deddmg on the best one. Students were informed of the scoring system. 

TIA is a power test, so no strict time limits were set. Students were told that it lakes about a class 
period or so to do the test. Teacher reports of the final data coUection indicated that most students 
finished the test in about 30-35 minutes, excluding time for directions (total time approximately 40-45 
nunutes). The intent was to allow students time to think and to carefiilty consider their answer choices 
without the pressure of a speed component. Test users may use the average completion time of 30 
nunutes as an indication of how their classes compare with others in time taken to do the test. 

During the development of TIA, attention was paid to a host of simple but important rules for test 
construction vAudi are in harmony with sound established measurement principles (Standards for 
Educatidtaal and Psychological Testing, 1985). Rules such as avoiding items with negative questions, 
using qualifiers cauUously such as "ahwiys" and "usually," and avoiding item stems simUar to text 
information. Other factors which may have contributed to test-taking strategies were considered in the 
test refinement process and have been reported in a previous section. 

Test-Wiseness 

There is a sense in which test wiseness has to do with general wiseness or pcrceptiveness. Students 
may capitalize upon cues of various sorts which would result in improved performance on a test for 
reasons other than use of the ability being tested. Students* verbal report protocols were studied in 
each of the pilot studies for evidence of U5e of such cues. Test revisions were made if cues were 
suspected. Test revisions were discussed in a previous section. 
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Guessing 

In the case of a short-answer test students guess only if they construct a response, which reduces the 
risk of attaining a higher score due to guessing. On the other hand, in the case of a multiple-choice 
test, the probability of attaining a higher score due to guessing is greater (Slakter, 1967). 

The scaled-answer scoring system (scores of 0, 1, 2, 3) used on TIA further compUcates the isiue of 
guessing, because the probability of a student getting some positive score for each iicm on TIA is 75 
Contrast this witii die standard four-option multiple-choice test with one correct answer where the 
probability of getting a positive score is only .25 on each item. Given the unusual scoring system used 
on TIA, it would be instructive to examine the distribution of scores wiiich could be obtained through 
guessing (Johnson & Kotz, 1977; Larsen, 1974). 

Table 12 presents the distribution of total possible test scores (0-108) and their corresponding 
cumulative probability under the assumption of random guessing on each item. Note that since there is 
a considerable chance of scoring points through guessing it is virtually impossible to guess and receive 
less than about 40. Even for a total score of 54 which would be 50%, there is a 47% chance of attaining 
at least tins high by guessing. However, if you look at a total score of 58, only 4 points higher, the 
chances of attaining a score of 58 or higher are dramatically reduced to 25%. 

[laacrt Tabk 12 about hera.] 

The overall mean score on TIA is 73.48 and as can be seen from Table 12 there is virtuaUy no chance 
of a student getting Uiis score or higher from guessing. Indeed the chances of getting any score of 60 or 
higher tiirough guessing are quite low. The pattern of probability distributions displayed in Table 12 
indicates that while scores up to about 60 can be expected to reveal very little about inference ability 
because of the guessing factor, scores above this point are virtual^ unattainable dirou^ guessing. 

You wiU recaU from the discussion of PUot Study 5 in a previous section Uiat there were highly 
significant correlations between reading scores and tiiinking scores (see Table 3) and tiut there were 
no significant differences in performance between the verbal report and the written response cohorts. 
These two points are worth mentioning here in terms of roundiq; out diis discussion on guessing. The 
first point provides evidence diat generally when students tiiought weU tiiey selected tiie best answer 
and tiiat students who reasoned poorly selected an alternate answer. An examinatiun of the verbal 
report protocols showed tiiat despite the opportunity of having a best-answer option, students generally 
chose the jiswer tiut made most sense to them, die one diat diey could justify. It would seem dien, 
that there was much more going on than guessing. 

The second point diat diere were no significant diflferences in performance between die verbal report 
and written cohorts may point to a uniqueness in die nature of die task. Recall diat Pilot Study 2 
showed minimal differences in die amount of time taken by students to complete die multiple-choice 
and &t.(>jt-attswer formats, suggesting diat die reasoning demands of die task were similar. TIA is a 
test of inference ability, die ability to integrate relevant textual information and background knowledge 
and requires reasoning regardless of response format. In oOier words, determining die best answer on 
TIA requires making an inference regardless of die format of die test. This argument is very similar to 
o«ie four J m die area of madiematics where it has been argued diat tests based on die same content 
but different formats require equivalent reasoning in test performance (Traub & Fisher, 1977). 

Kudflr.Rlchanisnn gft ReHabtlitv f ni»f»« 

Table 13 gives means, standard deviations, and KR-20 reliabilities for each story and for die total test. 
The Kiider-Richardson 20 reliability estimates are conservative; diey give a lower bound estimate of 
rehabiiity on a test. Neverdieless, it would be fair to say diat TIA's reliability of 79 is highly 
satisfactory given die number of test items (36) and die reported reUabilities of similar tests requiring 
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students to reason wcU such as the foUowing: Test on Appraising Observations (50 items) .69; ComeU 
Cnucai Thinking Test, Level X (71 items) .85; and the Watson-Glaser Critical Thinkina Appraisal 
Fom A (80 items) JSO. ^ 

(iBsert TaUt 13 about h^re.] 
Snmmflry nf Pini«i»nt Efforts flnd future Pm^p ^ts 

ThU report has described the design and development of a test of inference abiUty in reading 
comprehension. It is a scaled-answer multiple-choice test intended for use with students in GrJ-des 6 
7, and 8. " ' 

The Test of Inference Ability in Reading Comprehension is based upon what is currently known about 
mference and is in accord with the best available principles and information as described in a previous 
secuon. The pnnaple of inference appraisal and the work reported here are not meant in any sense to 
be definitive, but rather are meant to be a chart in what is an uncharted testing area. It is to be seen as 
an important starting point open to extension. The objectives, design, and evolution of the test 
reported m the preceding sections represent a comprehensive methodology aimed at validly appraising 
inference ability in reading comprehension. 

In order to have construct validity in tests of reading comprehension we must seek out the causes of 
performance on them. Responses on measures of reading comprehension may be correct or incorrect 
for very different reasons. Correct responses are not sufiBdent evidence of comprehension because 
sometunes they are the resuU of minimal reasoning. Conversely, incorrect responses are not sufBdent 
evidence of a lack of comprehension, because sometimes they are a result of comprehension. Students' 
verbal reports and written explanations as to why they made their choices are ways to seek out causes 
of performance. 

Central to this work are future piospects. The completion of a manual which will aUow for diagnostic 
info'mation for mstructional dedsion-making purposes is the next immediate project. Diagnostic 
urformation wiU be reported in a manner that describes studcnU' performance in terms of the quaUty 
of the inferences they have made, and the variations in inference ability across question types and 
across discourse forms. Sudi process-oriented information provides the necessary understanding of 
where students need instruction (Frederiksen, 1984), and to that end, spedfic teadiing suggestions wiU 
be offered. The development of a short-answer form of the tst which would aUow for a more direct 
evaluation of the effect of badcground beUeb and leveU of sophistication on students' performance is 
also planned. 

Other prospects indude studies to identify the kinds of strategies students' use in attempting to 
understand the various discourse forms, to measure the effectiveness of those strategies, and to explore 
thedaun that reading wefl is thinking weU by studyjig the reUtionships among inference strategy use, 
good mference-making in text comprehension, overaU reading comprehension, and critical thinkine 
performance. 

A future prospect of a more collaborative nature is to study the seemingly multiple perspectives on the 
appraisal of inference abiUty through an examination of the types of inference demands made by tL 
TIA test compared with those on current but more general comprehension assessment projects 
(Valencia & Pearson, 1987; Wixson, Peters, Weber, & Roeber, 1987). 
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Footnotes 

A r^. °1P^ is available by wnting to Linda M. Phillips. Institute for Educational Research 

and Development, Memonal University of Newfoundland, St. John's, Newfoundland, Canada 
A1B3XB or by telephoning (709) 737^25. 

^e technical definition of "story" does not strictly apply , all three iscoursc forms on TIA, 
but then neither does the alternate term passages." so I have deddeu to use sto y in the generic scnsT 

^Copies of aU three tables and appendir . m presented in the technical report wWch may be 
obtamed by contacting the author. ' 
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Table 1 

Pilot 2y Item Statistics 

Ilcm/Toial Hem Difficulty 
Item Correlations Index 



1 


.448 


Ml 


2 


.528 


.610 


3 


.667 


.317 


4 


.378 


.293 


5 


.477 


.512 


6 


.710 


.512 


7 


.389 


.683 


8 


.072 


.195 


9 


.414 


.488 


10 


.327 


.463 


11 


.752 


.829 


12 


QAA 


.805 


13 


.311 


.098 


14 


.217 


.561 


IS 


.217 


.561 


16 


.206 


.244 


17 


.573 


.293 


18 


-.053 


.07: 
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Item/Total Hem Difficulty 
Item Correlations Index 



19 


.340 


.366 


20 


-.153 


.366 


21 


.453 


.488 


22 


.544 


.561 


23 


.441 


.780 


24 


.203 


.171 


25 


.306 


.732 


26 


.795 


.902 


27 


.412 


.512 


28 


.101 




29 


.221 


.488 


30 


.271 


.585 


31 


.608 


.488 


32 


.301 


.463 


33 


.456 


.585 


34 


.560 


.463 


35 


-.316 


.098 


36 


.505 


.683 
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Table 2 

Pilot 2, % Agreement Between Fu.I Credit Short-Answer and Multiple-Choice 



% Item % 



1 


51 


19 


68 


2 


46 


20 


11 


3 


28 


21 


40 


4 


32 


22 


33 


5 


50 


23 


75 


6 


39 


24 


54 


7 


16 


25 


50 


A 

8 


11 


26 


10 


9 


28 


27 


31 


10 


59 


28 


8 


11 


47 






12 


79 


30 


0 


13 


18 


31 


53 


14 


18 


32 


49 


15 


59 


33 


62 


16 


r/ 


34 


63 


17 


40 


35 


11 


18 


43 


36 


45 
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Table 3 

Pilot 5, Pearson Correlation Coefficients Between Reading and Thinking Scores bv 
Item ^ 



Item Pearson's r 

1 .82" 

2 .62* • 

3 .69" 

4 ,77" 

5 .62* • 

6 .80" 

7 .72" 

8 .54" 

9 94.. 

10 .09 

11 .68" 

12 .50" 

13 .56** 

14 32" 

15 .30* 

16 .52" 

17 -.12 

18 .77" 



Item Peai..on's r 

19 .54" 

20 .45* 

21 .51" 

22 .39* 

23 .45* 

24 ,48* 

25 .66" 

26 .45* 

27 .42* 

28 .53* 

29 .72" 

30 .37* 

31 .61" 

32 .42* 

33 .38* 

34 .83* ♦ 

35 .54" 

36 .61* 



•p < .05 
"p < .001 
A^ = 95 
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Table 4 

Pilot 5, Pearson Correlation Coefficients between Reading Scores and Thinking 
Scores by Stoiy * 



Story Pearson's r 

'UFOs' Story 1 jj* 

'Money* Story 2 75» 

'The Wrong Newsp?ncrs- Story 3 jj* 



•p < .001 
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Tables 

Pilot 5, Story Reading Sco ^ Means by Coliort 
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Cohort 



o 4G 
ERIC 



Story 



Verbal Report 22.3 24.8 27,0 

Written 22.2 23.6 24.2 
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Table 6 

Pilot 5, Reading Score Means and Standard Deviations by Grade 



Grade m sD 

6 65.1 10.8 

7 70.3 12.5 

8 74.6 11.5 
All grades 69.9 12.1 



ERIC 



Phillips 



Inference Ability - 45 



Table 7 

Mean Scores by Sex^ Grade and Age 



Variable Mean N 
Sex 

Male 72J7 5)8 

Female 74.46 481 
Grade 

Grade 6 70.77 324 

Grade 7 72,99 330 

Grade 8 76.47 344 

Age 

11 Years 71.^9 218 

12 Years 73.29 29? 

13 Years 75.14 332 

14 Years 72.85 126 
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Tables 

ANOVA Results on Final Data: Test Score by Sex, Grade and Age 



Source of 
VariatioD 


Qiim nf 

Squares 


DF 


Mean 
Square 


F 


Significance 
of F 


Sex 




1 
1 


J /O.JO 


3.73 


.054 


Grade 


2498.39 


2 


1249.20 


8.08 


.000 


Age 


2425.09 


3 


808.36 


5.23 


.001 


SexXGiade 


112Z91 


2 


561.45 


3.63 


.027 


Sex X Age 


827.75 


3 


275.92 


1.78 


.148 


Grade X Age 


2595.26 


4 


648.81 


4.20 


.002 


Within 


141319.11 


914 


154.62 
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Table 9 

Percentage of Students Obtaining each Possible Score per Item 



Item 


Grade 




Scores 




Item 


Grade 




Scores 




Item 


Grade 


Scores 








0 


1 


2 


3 






0 


1 


2 


3 






0 


1 


2 


3 


1 


6* 


37 


17 


27 


19 


13 


6 


12 


16 


13 


59 


25 


6 


9 


7 


9 


75 




7** 


40 


11 


24 


25 




7 


13 


14 


9 


64 




7 


10 


8 


11 


71 




8*** 


38 


10 


27 


25 




8 


8 


D 


10 


69 




8 


6 


6 


12 


76 


2 


6 


6 


10 


17 


67 


14 


6 


5 


59 


11 


25 


26 


6 


23 


4 


14 


59 




7 


5 


11 


12 


72 




7 


5 


69 


8 


18 




7 


22 


5 


15 


58 




8 


5 


7 


9 


79 




8 


4 


68 


5 


23 




8 


26 


5 


11 


58 


3 


6 


12 


28 


10 


50 


15 


6 


47 


2 


5 


46 


27 


6 


26 


11 


20 


43 




7 


13 


28 


7 


52 




7 


34 


6 


8 


52 




7 


27 


11 


17 


45 




8 


17 


23 


7 


53 




8 


29 


3 


5 


63 




8 


25 


8 


21 


46 


4 


6 


29 


34 


16 


21 


16 


6 


10 


14 


13 


63 


28 


6 


16 


9 


30 


45 




7 


26 


29 


17 


28 




7 


6 


10 


17 


67 




7 


17 


9 


28 


46 




8 


25 


31 


22 


22 




8 


7 


5 


18 


70 




8 


13 


9 


36 


42 


\ 

-/ 


6 


4 


18 


43 


35 


17 


6 


12 


24 


38 


26 


29 


6 


11 


11 


45 


33 




7 


3 


17 


33 


47 




7 


9 


30 


37 


23 




7 


11 


8 


40 


41 




8 


2 


20 


30 


48 




8 


10 


22 


44 


24 




8 


11 


11 


33 


45 


6 


6 


19 


19 


11 


51 


18 


6 


23 


12 


11 


54 


30 


6 


27 


14 


10 


49 




7 


13 


18 


7 


62 




7 


22 


10 


9 


59 




-1 


24 


15 


10 


51 




8 


8 


20 


5 


67 




8 


18 


10 


9 


63 




8 


20 


11 


11 


58 


7 


6 


23 


19 


9 


49 


19 


6 


16 


14 


27 


43 


31 


6 


21 


4 


31 


44 




7 


15 


17 


B 


60 




7 


18 


14 


25 


43 




7 


18 


5 


31 


46 


Q 
O 


o 
O 


15 


17 


9 


58 




8 


17 


13 


27 


43 




8 


10 


4 


25 


61 


6 


15 


16 


16 


53 


20 


6 


15 


16 


23 


46 


32 


6 


35 


5 


13 


47 




7 


1 o 

18 


14 


16 


52 




7 


11 


20 


20 


49 




7 


34 


5 


13 


48 




A 

) 


17 


11 


16 


56 




8 


8 


19 


17 


56 




8 


28 


2 


14 


56 


Q 

y 


6 


4 


51 


6 


39 


21 


6 


15 


16 


15 


54 


33 


6 


16 


8 


?0 


56 




7 


1 


43 


5 


51 




7 


13 


12 


11 


64 




7 


20 


8 


21 


51 


10 


O 

o 


L 


39 


4 


55 




8 


7 


13 


17 


63 




8 


12 


12 


18 


58 


6 


6 


29 


34 


31 


22 


6 


4 


17 


26 


53 


34 


6 


34 


8 


21 


37 




7 


6 


28 


35 


31 




7 


3 


17 


27 


53 




7 


27 


8 


22 


43 


11 


8 


4 


22 


35 


39 




8 


3 


18 


22 


57 




8 


19 


10 


22 


49 


6 


3 


19 


21 


57 


23 


6 


25 


9 


6 


60 


35 


6 


21 


11 


45 


23 




7 


3 


16 


19 


62 




7 


22 


8 


4 


66 




7 


23 


12 


40 


25 


12 


8 


3 


15 


16 


66 




8 


21 


10 


6 


63 




8 


8 


13 


12 


67 


6 


5 


8 


8 


79 


24 


6 


7 


7 


48 


37 


36 


6 


14 


16 


11 


59 




7 


3 


9 


6 


82 




7 


5 


11 


47 


37 




7 


19 


11 


11 


59 




8 


6 


6 


7 


81 




8 


5 


11 


43 


41 




8 


8 


13 


12 


67 



* 324 students 

* 330 students 

* 344 students 
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Table 10 
Item Statistics 

Item/Test Item Difficulty 
Item Correlations Index 



1 


.212 


1.338 


2 


.223 


2.531 


3 


.190 


1.961 


4 


.151 


1.387 


5 


JO 


2.183 


6 


.359 


2.153 


7 


326 


2.021 


8 


.224 


2.063 


9 


316 


1.998 


10 


229 


1.982 


11 


.198 


2396 


12 


266 


2.632 


13 


.185 


2.277 


14 


-.010 


1.469 


15 


222 


1.766 


16 


26i 


2.414 


17 


JOSS 


1.781 


18 


.220 


2.053 
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Item/Test Item DifTiculty 
Item Correlations Index 



19 


.179 


1.943 


20 


.281 


2.101 


21 


.388 


2.241 


22 


371 


2.302 


23 


.162 


2.078 


24 


.295 


2.369 


25 


.403 


2.501 


26 


.172 


1.825 


27 


.195 


1.825 


28 


325 




29 


.427 


2.072 


30 


•JOO 


1.906 


31 


.478 


2.145 


32 


.303 


1.816 


33 


.451 


2.132 


34 


.334 


1.806 


35 


.357 


1.736 


36 


.48^ 


2.199 



51 



Inference Ability ■ 49 

Table 11 

Percentage of All Responses Receiving Scores of 0, 1, 2 and 3 by Grade Uvel and Story 



^^^^ Money Newspapers 

Grade 0123 0123 0123 

6 14 22 18 46 16 17 19 48 21 9 22 48 

7 12 20 16 52 13 18 19 50 21 8 21 50 

8 13 20 16 53 12 17 19 52 16 8 23 53 
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Table 12 

Distribution of Total Test Score Under Assumption of Random Response 



1 Oiw 


Luffl 


Total 


Cum 


Total 


Cum 




% 


Score 


% 


Score 


% 


1 


Z.12E-2U 


37 


.667 


73 


99.8 




1.49E-17 


3S 


1.01 


74 


99.9 




1.94E-16 


39 


1.50 


75 


99.9 


A 
•1 


1 CVIE 1 C 

I.y3t-15 


40 


2.18 


76 


100 


C 

J 


1 CQE 1i4 


41 


3.10 


77 


100 


0 


1 lie 1 ^ 


42 


431 


78 


100 




0.75E-13 


43 


5.88 


79 


100 






44 


7.85 


80 


100 


0 


1 Q^E 1 1 


45 


10.2 


81 


100 


in 

iU 




46 


13.2 


82 


100 


11 


J.44t- lU 


47 


16.7 


83 


100 


iz 


1.34t-9 


48 


20.7 


84 


100 


LJ 


A one fi 


49 


25.2 


85 


100 


Id 
l<f 


1 iCOC o 

l.OoE-o 


50 


30.2 


86 


100 


ID 


< ah: o 
3.4 /t-o 


51 




87 


100 


IQ 


1 T 

i.oyti- / 


52 


41.2 


88 


100 




4.yDti- / 


53 


47.0 


89 


100 


lO 




54 


53.0 


90 


100 


10 


J. /4tl-0 


55 


58.8 


91 


100 


20 




56 


64.5 


92 


100 


21 




57 


69.8 


93 


100 


22 


^ 71P ^ 


CO 

58 


74.8 


94 


100 


2.1 


1 "^IP J. 


59 


793 


95 


100 


24 




60 


833 


96 


100 


2S 




61 


87.0 


97 


100 


2A 




62 


90.0 


98 


100 


27 


2.65E-3 


63 


92.0 


99 


100 


ZO 


5.21E-3 


64 


94.0 


100 


100 


29 


9.92E-3 


65 


95.7 


101 


100 


30 


.018 


66 


96.9 


102 


100 


31 


.033 


67 


97.8 


103 


100 


32 


.058 


68 


98.4 


104 


100 


33 


.100 


69 


98.9 


105 


100 


34 


.166 


70 


993 


106 


100 


35 


.271 


71 


99.6 


107 


100 


36 


.430 


72 


99.7 


108 


100 



o 
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Table 13 

Means, Standard Deviations, and KR.20 Reliabilities for Stoiy and Total Test 



Standard 

^P^<^ Mean Deviation KR-20 



UFOs (items 1-12) 24.65 5.37 

Money (items 13-24) 2'i.59 4.99 49 

Newspapers (items 25-36) 24.23 7.31 77 

Total Test (items 1-36) 73.48 13.50 ,79 



Appendix A 

Reiding Rating Scale for Test 
of Inference Ability in Reading Compreliension 

The reading score is based upon the quality of the interpretation given by students. 

HA contains 36 items. For each item a student is assigned a score of 0, 1, 2, or 3 for a possib e total of 
108 pomts if a student made complete inferences on all items. For each item, reading will be rated 
according to the following scale. 

^^aliag TIA Reading Fvalnafi. .n 

3 The student integrates relevant text information and relevant background 

knowledge to construct ssmakH interpretations that are consistent with both 
the text information and background knowledge. Thus, the student has given a 
complete iiiference answer. 

2 The student integrates sQuifi text information and background knowledge but 

fails to take into account the available relevant information. The student's 
answer is consist' rt with some relevant text information and background 
knowledge but h 'complete. Thus, the student has given a partially-correct 
inference answer. 

1 The student locates relevant text information but fails to integrate it with 

relevant background knowledge. Thus, the student has given a non-inference 
answer. 

0 The student makes an inconsistent use of the text information and background 

knowledge. Thus, the student has given an implausible answer. 

Each of the scale graduations are exemplified in the subsequent section. Examples are taken from 
selected items on TIA. Test item stems are provided with student answers in bold. Evaluator's 
comments are also provided. 

3 points (complete iufercncc) 

The student integrates relevant text information and background knowledge to construct complete 
mterprctations that are consi'^ent with the text information and background knowledge. Students 
substantiate their answers with relevant evidence from both text information and background 
knowledge in a logical and coL*rent manner. Examples mdude the following: 

(1) "Usine available information people learn the most about UFOs by f Mning the informaUon in 
ail the repok (».' In this example, the student recognizes the story says th. uJsts use three kinds of 
reports to study UFOs. The most information about UFOs would be attaint* x)ling all three, thus 
the preceding answer is consistent and complete with both text informa^v, and background 
knowled^^e. 

(2) "Increased evidence is available today about UFOt, than years ago because we have more scientiflc 
equipment to study UFOs." In this example, ihe student uses the text information about potential use 
of weather cameras in sateUites and mcreased knowledge of the universe to compare the present to the 
past. This information is then used to further reason that scientific equipment is an import^t factor in 
learning more about UFOs, thus the preceding answer is consistent and complete with both text 
mformation and background know 'dge. 
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2 points (partialiy-comct infereace) 

The student gives an answer that indicates an integration of some text information and backffround 
knowledge but fails to take into account available, relevant information. The answer is consistent with 
some text information and relevant background knowledge, but is incomplete. Examples of partiaUv- 
complete infnrence answers follow: r f / 

(1) "UFOs are sometimes caUed other names because people dont know what to call them so they 
name them by shape." In this example, the student reasoned from some relevant tcxr information but 
did not provide alternate interpretations. The student did not appear to monitor for consistency and 
completeness w.rh available text evidence, that is, UFOs are called other uames because of their 
probab'? ongir^ thus the preceding answer is only partially-correct. 

(2) "Something unidentified in the sky may be caUed a UFO because that is what people call it when 
they jump to conclusions." In this exampi-,, the student reasoned from some relevant text information 
but did not remam tentative. The preceding answer is a case where such a statement may be true, but 
It does not represent a complete interpretation of available, relevant text information pertainina to the 
question posed. ^ -onus mc 

(3) "It is not known where UFOs come from, it seems they could be !n>ai Earth because nc have the 
materials and people to build such craft." In this example, the student reasoned from an unwarranted 
assumption to a justifiable alternative interpretation. The student has con tructed an interpretation, 
but overlooks the fact that we do not know what UFOs are, so how could we construct them? On the 
other hand, the student may he thmking thai _jfi UFOs are misnamed spacecraft from Earth which 
makes the answer partially-correct. 

1 point (noo'lnferencc) 

It may be that the student did not i jiderstand the task, or that the student is more accustomed to non- 
inferential questions which oi:«n require the mere location of information than to inference questions 
which require the integration oi relevant text information and background knr Wedge In the latter 
case students give an answer dl-crtly related to the text, or an answer which reflects minimal 
substantiation with the text evidence. Examples of non-inference answers follow: 

(1) "UFO stories are very different from each other because people someUme.i think things like 
weather satellites, clouds, and bright stars are UFO,." In this example, the answer given by the 
student is directly from the UFOs story. 

(2) "It is not known where UFOs come from, it seems they couldnt be from almost anywhen because 
the story said they came from othe.^ planets." In this example, the student seemed to forego other 
possible answers for the sake of spedfic text infor-n^ition. 

0 point (implausible answer) 

It may be that the student did not understand the task, or Jiat the student's answer is unsubstantiated 
examples of unp.ausible answers follow: 

^11 - V^? P^P"* unidentined (lying objects in the 

sKy. In this example, the student's answer is circular, it does not answer the item. 

(2) "Some people think the study of UFOs should be continuec occause some scientists think UFOs 
are not real." In this example, the student makes inappropriate use of text information. The text says. 
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"In 1969 one group of scientists concluded that there was not enough evidence to prove that UFOs are 
real and that UFOs are not worth further study." The answer given is implausible. 

(3) "Something unidentified in the sky may be called a UFO because that is the ihape of whatever it is 
in the sky." In this example, the student was vague. The answer is unclear because it docs not say what 
the shape is, nor specify what it is that is in the sky. 

(4) "UFO stories tjt very different from each other because people tend to exaggerate what they see 
and think of different names for UFOs." The student did not take into account available text 
information which says. The weather, the time of day, and the number of people watching UFOs may 
make the siiories different" in formulating an answer. 
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Appendix B 

Thinking Rating Scale for Phil]ips.Patterso!i Test 
of Inference Ability in Reading Comprehension 



One of the variables derived to appraise the quality of JJ^ is a thinking score. The score is based upon 
an analysis of verbal report interviews as students dted why they chose a particular answer as the best 
answer on the test. 

UA contains 3 stories each with 12 items. Students were asked to tliink-aloud on 1 of the 3 stories. 
For each item a student's thinking will be rated between 0 and 3 for a total of 36 points if a student 
thought well on all 12 items. For each item, thinking will be rated according to the following scale. 

Basing T!A Thmlrj^yfy^l^n^j^n 

3 The student cites all relevant textual and background information in the 

explanation of an answer chdce. That is, the student considers the question and 
tne available textual and background in-jrmalion pertinent to it in the 
formulation of a response which is complete and consistent. 

2 The student cites sfiofi of the relevant textual and background information in the 

explanation. That is, the student considers cither a part of the question and the 
available textual and background information pertinent to it, or the student 
considers the question and part of the available information pertinent to it in the 
formulation of a response which is consistent but incomplete 

1 The student cites insufficient relevant textual and background information in the 

formulation of a response. That is, the student's response is not sufficient to 
indicate a dear understanding of either the question or the story. It is in need of 
elaboration and contains information wfaidi i& partially correct and partially 
erroneous. However, it does reflect minimal integration of relevant information. 

0 The student dtes imlfiYUt or erroneous or repeat^ textual, Nackground 

information, or both in the formulation of a response. That is, the student either 
misunderstands, misconstrues the story, or repeats the selected answer or textual 
information wiih no interpretation. 

It is important to be cognizant of at least two precautions in rating students' thinking. The first is that 
this scale is fundamentally a measure of reasoning ability, not expressive ability. The goal is to focus on 
the quaUty of thinking in the verbal report mtervicws, rather than on the quality of effective speech. 

The second precaution concerns the context of student justifications of answer choice on H^. When 
students give a justification or what is intended to be a justification they do so having made an answer 
choice, so the story, the item stem, and the i!»swcr selected by the student create the context for the 
verbal report interview. In other words, when students tell why they selected a particular answer to be 
the "best" the context of a student response must be used in rating the quality of student thinking. 

Each of the scale graduations are exemplified in the subsequent section. Examples are taken from 
student verbal report interviews for each of the 3 stories on H^. Test item stems arc in bold and 
student "best" answer choices complete the test item. Student interview comments are then presented, 
followed by an evaluator's comments. 

Item 1 on the UFO story is evaluated as follows: 
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3 points 



UFOs are sometimes called otiier names because people na^c them according to their shape or 
probable origin. 

Student says, "I think that's the best answer because we do not really know what the/ (UFOs) are. 
UFOs are called 'flying saucers' and 'spaceships from other planets' and other names in the story, so 
since we don't rcaUy know what they are or where they come from, or if they come from anywhere, 
then people give them names that kinda makes sense according to their shape or where they might be 
from." 

Shape and origin are the factors whiuh must be inferred by a student from the textual information and 
background information in order to be;t explain why UFOs might be called other names. 

2 poiitts 

UFOs arc sometimes called other names because people don't know what to call them so name them 
by shape. 

Student says, "I think that's the best answer because people don't know what to call them so they 
exaggerate about the shape, cause like they call them 'Qj^ng saucers'." 

Shape was only one of the factors to be inferred about why UFOs might be called other names; the 
student's reported thinking reflects the use of some relevant textual and background information. The 
student did not incorporate the textual information "spaceships from other planets" and "extraterrestrial 
spacecraft" into reasoning for the best ansii^r. 

1 point 

UFOs are sometimes called other names because people see an area with many coloured lights in the 
sky. 

Student says, "I think that's the best answer cause it says it in the story" (student points to text). 

An interviewer says "So, why are UFOs sometimes called other names?" and student responds by 
reading from the text: 

"Stories have been told that UFOs light up an area with coloured lights and that creatures of different 
sizes and colours have been seen in them." 

In this instance, the student fruled to make the intended inference in response to the question but docs 
seem to indicate awareness of related textual information. The selected response and reasoning are 
examples of insufiSdent use of relevant textual and background information. 
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0 point 



UFOs are sooictliiKd called other names because people know they are unidentified flyina obiects in 
the sky. ^ ' 

Student says "It's strange to see unidentified Hying objects in thr sky and mostly they arc UFOs, which 
is a name they gave them.* 

In this instance, the student failed to answer the question nor to provide quality reasoning; indeed the 
student response begs the question. 

Item 9 on the Monev story is evaluated as follows: 

3 points 

The money system is dlflSerent fh>m the trade system because money gave things a standard unit 
whereas trade did not. 

Student says, "I think that's the best answer because in the old days f »ple could never be sure they got 
the same exchange value for their trade products. Like a cow might n^ : really be the same value as say 
a piece of land but if one feUow wanted the land and the other feUow the cow, then they exchanged 
because they needed it even though the trades might not be fair. Money is better because say if a cow 
costs three hundred dollars, then people Isnow if that is a good price, three hundred dollars is three 
hundred dollars no matter what you're buying, and with money there's change. Another thing is that 
cows might be worth different amounts.' 

To successfiUly answer this question a student must reason as to the differences between t!K money 
and trade systems by using both textual and background information, which the above sfudent has 
done. 

2 points 

The money system is dilTerent from the trade system because one animal could be worth more in 
trading than the other. 

Student says, "because one cow could be a good healthy cow giving milk and stuff but another cow 
might be old and sick, so they wouldn't be a fair trade for a piece of land. The cows and the land 
wouldn't be worth the same.' 

This response addressed only part of the test item. The student used textual and background 
•Mormauon to cte the inequities of the trade system but did not speak to the money system. 

1 point 

The moMy sjiten is difTerent from the trade system because it might cost you ten cows for a piece of 
land in trading. 

Student says, 'because that's what it says up in the stoiy.' Upon fimher enquiry as to why the two 
systems are different, the student does not add further clarification. 

This student response docs not represent an integration of the textual with background information, 
the student merely offered related informaUon without any indication of having r asoned throunh tue 
question. ^ 
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0 point 

TTie money system Is different from the tnde system because money is what people use all the time so 
It IS newer. 

Stud ^t says, •Everybody knows what money }s and how to use it." 

In this particular instance, the student used background information to respond with an irrelevant 
statement which does not answer the item. 

Item 3 on The Wrony Newspapers story is evaluated as follows: 

3 points 

Yesterday's paper was stUI laying in the puddk because something odd has happened. 

Students says, "it seems from the story that Ann knew Mr. Jones was home and it was pecuUar that be 
didn t pick up his paper. It couldn't have been the rain because Ann was able to deliver her papers, so 
something must have happened Anyway the tide is a due that something strange has happened.' 

In this item, the student must incorporate the textual and background information to answer why the 
paper might be i/ing in the puddle. 

2 points 

Yesterday's paper was still jying in tic puddle because the wcafJier was too wet. 

Student rays, "more than likely t was too wet for Mr. Jones to come out to get it, so it was stiU in the 
puddle. 

In this example, the student used only some of the relevant information to formulate a justification, and 
appwrs to have reUed more upon background than upon an incorporation of both textual and 
background information. 

1 point 

Yesterday's paper was still lying in the puddle because it was in a plastic bag. 
Student says, 'It says in the story that's how Ann left the paper." 

In this instance, the student did not mcorporate the relevant textual and background information but 
the response mdicates an awareness of related textual information. 

0 point 

Yesterda/s paper was stlU lying in the puddlr >cause it should have been picked up. 

Student says, -because she knew that Mr. Jones was away and she knew that there was someone in the 
house and the paper should have been i»cked up." 

In this case the student offered a subjective n^sponse, misread the textual information, and failed to 
answer the item. 
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